INN Hotels Project¶

Context¶

A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

  • Loss of resources (revenue) when the hotel cannot resell the room.
  • Additional costs of distribution channels by increasing commissions or paying for publicity to help sell these rooms.
  • Lowering prices last minute, so the hotel can resell a room, resulting in reducing the profit margin.
  • Human resources to make arrangements for the guests.

Objective¶

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description¶

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

  • Booking_ID: unique identifier of each booking
  • no_of_adults: Number of adults
  • no_of_children: Number of Children
  • no_of_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
  • no_of_week_nights: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
  • type_of_meal_plan: Type of meal plan booked by the customer:
    • Not Selected – No meal plan selected
    • Meal Plan 1 – Breakfast
    • Meal Plan 2 – Half board (breakfast and one other meal)
    • Meal Plan 3 – Full board (breakfast, lunch, and dinner)
  • required_car_parking_space: Does the customer require a car parking space? (0 - No, 1- Yes)
  • room_type_reserved: Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels.
  • lead_time: Number of days between the date of booking and the arrival date
  • arrival_year: Year of arrival date
  • arrival_month: Month of arrival date
  • arrival_date: Date of the month
  • market_segment_type: Market segment designation.
  • repeated_guest: Is the customer a repeated guest? (0 - No, 1- Yes)
  • no_of_previous_cancellations: Number of previous bookings that were canceled by the customer prior to the current booking
  • no_of_previous_bookings_not_canceled: Number of previous bookings not canceled by the customer prior to the current booking
  • avg_price_per_room: Average price per day of the reservation; prices of the rooms are dynamic. (in euros)
  • no_of_special_requests: Total number of special requests made by the customer (e.g. high floor, view from the room, etc)
  • booking_status: Flag indicating if the booking was canceled or not.

Problem Definition¶

Analyze the data of INN Hotels to find which factors have a high influence on booking cancellations which lead to revenue losses, reduced profit margins, inefficient resource allocation, and build a predictive model that can predict which booking is going to be cancelled in advance, and help in formulating profitable policies for cancellations and refunds.

Importing necessary libraries and data¶

In [160]:
# Installing the libraries with the specified version.
!pip install pandas==1.5.3 numpy==1.25.2 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 statsmodels==0.14.1 -q --user

Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.

In [2]:
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# setting the precision of floating numbers to 5 decimal points
pd.set_option("display.float_format", lambda x: "%.5f" % x)

# Library to split data
from sklearn.model_selection import train_test_split

# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To tune different models
from sklearn.model_selection import GridSearchCV


# To get diferent metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    precision_recall_curve,
    roc_curve,
    make_scorer,
)

import warnings
warnings.filterwarnings("ignore")

from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)

Import Dataset¶

In [3]:
# uncomment and run the following lines for Google Colab
# from google.colab import drive
# drive.mount('/content/drive')
In [4]:
# Code to let colab access my google drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [5]:
# Code to read dataset
hotel = pd.read_csv('/content/drive/My Drive/INNHotelsGroup.csv')
In [6]:
# copying data to another variable to avoid any changes to original data
data = hotel.copy()

Data Overview¶

  • Observations

  • Sanity checks

  • Observations to check if data has been uploaded properly or not. Here, I check the first five rows and the last 5 rows of the dataset

  • Sanity checks to get information about the number of rows and columns in the dataset, find out the data types of the columns to ensure that data is stored in the preferred format and the value of each property is as expected, check the statistical summary of the dataset to get an overview of the numerical columns of the data, check for duplicate values, and missing values.

In [7]:
# Code to view the first 5 rows of the dataset
data.head()
Out[7]:
Booking_ID no_of_adults no_of_children no_of_weekend_nights no_of_week_nights type_of_meal_plan required_car_parking_space room_type_reserved lead_time arrival_year arrival_month arrival_date market_segment_type repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests booking_status
0 INN00001 2 0 1 2 Meal Plan 1 0 Room_Type 1 224 2017 10 2 Offline 0 0 0 65.00000 0 Not_Canceled
1 INN00002 2 0 2 3 Not Selected 0 Room_Type 1 5 2018 11 6 Online 0 0 0 106.68000 1 Not_Canceled
2 INN00003 1 0 2 1 Meal Plan 1 0 Room_Type 1 1 2018 2 28 Online 0 0 0 60.00000 0 Canceled
3 INN00004 2 0 0 2 Meal Plan 1 0 Room_Type 1 211 2018 5 20 Online 0 0 0 100.00000 0 Canceled
4 INN00005 2 0 1 1 Not Selected 0 Room_Type 1 48 2018 4 11 Online 0 0 0 94.50000 0 Canceled
In [8]:
# Code to view the last 5 rows of the dataset
data.tail()
Out[8]:
Booking_ID no_of_adults no_of_children no_of_weekend_nights no_of_week_nights type_of_meal_plan required_car_parking_space room_type_reserved lead_time arrival_year arrival_month arrival_date market_segment_type repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests booking_status
36270 INN36271 3 0 2 6 Meal Plan 1 0 Room_Type 4 85 2018 8 3 Online 0 0 0 167.80000 1 Not_Canceled
36271 INN36272 2 0 1 3 Meal Plan 1 0 Room_Type 1 228 2018 10 17 Online 0 0 0 90.95000 2 Canceled
36272 INN36273 2 0 2 6 Meal Plan 1 0 Room_Type 1 148 2018 7 1 Online 0 0 0 98.39000 2 Not_Canceled
36273 INN36274 2 0 0 3 Not Selected 0 Room_Type 1 63 2018 4 21 Online 0 0 0 94.50000 0 Canceled
36274 INN36275 2 0 1 2 Meal Plan 1 0 Room_Type 1 207 2018 12 30 Offline 0 0 0 161.67000 0 Not_Canceled

Observations

The dataset has been uploaded properly, clearly identifying the columns and the rows in the dataset. We can proceed to check the shape of the data to know exactly how many columns and rows are there.

In [9]:
# Code to check the shape of the dataset
data.shape
print('There are', data.shape[0],'rows and', data.shape[1],'columns')
There are 36275 rows and 19 columns

Observations

The dataset has 36275 rows and 19 columns.

We can proceed to check the datatypes of the different columns in the dataset.

In [10]:
# Code to determine datatypes of different columns in the dataset
info = data.info()
print(info)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Booking_ID                            36275 non-null  object 
 1   no_of_adults                          36275 non-null  int64  
 2   no_of_children                        36275 non-null  int64  
 3   no_of_weekend_nights                  36275 non-null  int64  
 4   no_of_week_nights                     36275 non-null  int64  
 5   type_of_meal_plan                     36275 non-null  object 
 6   required_car_parking_space            36275 non-null  int64  
 7   room_type_reserved                    36275 non-null  object 
 8   lead_time                             36275 non-null  int64  
 9   arrival_year                          36275 non-null  int64  
 10  arrival_month                         36275 non-null  int64  
 11  arrival_date                          36275 non-null  int64  
 12  market_segment_type                   36275 non-null  object 
 13  repeated_guest                        36275 non-null  int64  
 14  no_of_previous_cancellations          36275 non-null  int64  
 15  no_of_previous_bookings_not_canceled  36275 non-null  int64  
 16  avg_price_per_room                    36275 non-null  float64
 17  no_of_special_requests                36275 non-null  int64  
 18  booking_status                        36275 non-null  object 
dtypes: float64(1), int64(13), object(5)
memory usage: 5.3+ MB
None

Observations

Of the 19 columns in the dataset, we have 1 float, 13 integers, and 5 objects (Booking ID, Type of Meal Plan, Room Type Reserved, Market Segment Type, and Booking Status are all strings). Memory usage is 5.3+ MB.

There appears to be no missing values in the entries. We can confirm this using the Python function data.isnull().sum() but first let's check the statistical summary of the dataset.

In [11]:
# Code to check the statistical summary of dataset
data.describe().T
Out[11]:
count mean std min 25% 50% 75% max
no_of_adults 36275.00000 1.84496 0.51871 0.00000 2.00000 2.00000 2.00000 4.00000
no_of_children 36275.00000 0.10528 0.40265 0.00000 0.00000 0.00000 0.00000 10.00000
no_of_weekend_nights 36275.00000 0.81072 0.87064 0.00000 0.00000 1.00000 2.00000 7.00000
no_of_week_nights 36275.00000 2.20430 1.41090 0.00000 1.00000 2.00000 3.00000 17.00000
required_car_parking_space 36275.00000 0.03099 0.17328 0.00000 0.00000 0.00000 0.00000 1.00000
lead_time 36275.00000 85.23256 85.93082 0.00000 17.00000 57.00000 126.00000 443.00000
arrival_year 36275.00000 2017.82043 0.38384 2017.00000 2018.00000 2018.00000 2018.00000 2018.00000
arrival_month 36275.00000 7.42365 3.06989 1.00000 5.00000 8.00000 10.00000 12.00000
arrival_date 36275.00000 15.59700 8.74045 1.00000 8.00000 16.00000 23.00000 31.00000
repeated_guest 36275.00000 0.02564 0.15805 0.00000 0.00000 0.00000 0.00000 1.00000
no_of_previous_cancellations 36275.00000 0.02335 0.36833 0.00000 0.00000 0.00000 0.00000 13.00000
no_of_previous_bookings_not_canceled 36275.00000 0.15341 1.75417 0.00000 0.00000 0.00000 0.00000 58.00000
avg_price_per_room 36275.00000 103.42354 35.08942 0.00000 80.30000 99.45000 120.00000 540.00000
no_of_special_requests 36275.00000 0.61966 0.78624 0.00000 0.00000 0.00000 1.00000 5.00000

Observations

  • Minimum Average price per room is 0.00 euros. The first quartile is 80.30 Euros. The third quartile is 120 euros. The maximum is 540 euros. The median is 99.45 euros.
  • Maximum number of previous cancellations is 13
  • Maximum number of previous bookings not cancelled is 58.
  • Average price per room is 103.42 euros.
  • Average number of adults in a room is 1.844, approximately 2. The median is 2. The maximum is 4.
  • Minimum lead time is 0.00, with maximum of 443. Median is 57 and mean is 85.9. There could be some skewness here. -Maximum number of children is 10. Average is less than 1. No number is detected for the first quartile, median, and third quartile. This could be an indication that most parents do not bring their children with them.
In [12]:
# Code to check missing values in the dataset
data.isnull().sum()
Out[12]:
0
Booking_ID 0
no_of_adults 0
no_of_children 0
no_of_weekend_nights 0
no_of_week_nights 0
type_of_meal_plan 0
required_car_parking_space 0
room_type_reserved 0
lead_time 0
arrival_year 0
arrival_month 0
arrival_date 0
market_segment_type 0
repeated_guest 0
no_of_previous_cancellations 0
no_of_previous_bookings_not_canceled 0
avg_price_per_room 0
no_of_special_requests 0
booking_status 0

Observations

There are no missing values in the dataset.

In [13]:
# Code to check for duplicates
data.duplicated().sum()
Out[13]:
0

Observations

There are no duplicated entries in the dataset.

With no missing values and duplicated entries, the data is ready for detailed analysis but first let's drop the Booking_ID column first before we proceed.

In [14]:
#Code to drop Booking-ID so we can proceed
data.drop('Booking_ID',axis=1,inplace=True)
In [15]:
data.head()
Out[15]:
no_of_adults no_of_children no_of_weekend_nights no_of_week_nights type_of_meal_plan required_car_parking_space room_type_reserved lead_time arrival_year arrival_month arrival_date market_segment_type repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests booking_status
0 2 0 1 2 Meal Plan 1 0 Room_Type 1 224 2017 10 2 Offline 0 0 0 65.00000 0 Not_Canceled
1 2 0 2 3 Not Selected 0 Room_Type 1 5 2018 11 6 Online 0 0 0 106.68000 1 Not_Canceled
2 1 0 2 1 Meal Plan 1 0 Room_Type 1 1 2018 2 28 Online 0 0 0 60.00000 0 Canceled
3 2 0 0 2 Meal Plan 1 0 Room_Type 1 211 2018 5 20 Online 0 0 0 100.00000 0 Canceled
4 2 0 1 1 Not Selected 0 Room_Type 1 48 2018 4 11 Online 0 0 0 94.50000 0 Canceled

Observations

Booking ID has been dropped.

Exploratory Data Analysis (EDA)¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Leading Questions:

  1. What are the busiest months in the hotel?
  2. Which market segment do most of the guests come from?
  3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
  4. What percentage of bookings are canceled?
  5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
  6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

Univariate Analysis¶

In [16]:
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (15,10))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [17]:
def labeled_barplot(data, feature, perc=False, n=None):
       """
       Barplot with percentage labels.

       data: dataframe
       feature: dataframe column
       perc: whether to display percentages instead of counts (default False)
       n: number of bars to display (default None, displays all bars)
       """
       total = len(data[feature])  # length of the column
       count = data[feature].nunique()
       if n is None:
           plt.figure(figsize=(count + 1, 5))
       else:
           plt.figure(figsize=(n + 1, 5))

       plt.xticks(rotation=90, fontsize=15)
       ax = sns.countplot(
           data=data,
           x=feature,
           order=data[feature].value_counts().index[:n].sort_values(),
           palette="Set2",
       )

       for p in ax.patches:
           if perc == True:
               label = "{:.1f}%".format(
                   100 * p.get_height() / total
               )  # percentage of each class of the category

Observations on No_of_Adults¶

In [18]:
labeled_barplot(data, "no_of_adults", perc=True)
No description has been provided for this image
In [19]:
data['no_of_adults'].unique()
Out[19]:
array([2, 1, 3, 0, 4])
In [20]:
def get_bookings_by_adults(data, num_adults):
  """
  Gets the number of bookings for a given number of adults.

  Args:
    data: The pandas DataFrame containing booking data.
    num_adults: The number of adults to filter by.

  Returns:
    The number of bookings with the specified number of adults.
  """
  bookings_for_adults = data[data['no_of_adults'] == num_adults]
  num_bookings = bookings_for_adults.shape[0]
  return num_bookings

# Get bookings for different adult counts
adult_counts = [0, 1, 2, 3, 4]
for num_adults in adult_counts:
  num_bookings = get_bookings_by_adults(data, num_adults)
  print(f"Number of bookings for {num_adults} adult(s): {num_bookings}")
Number of bookings for 0 adult(s): 139
Number of bookings for 1 adult(s): 7695
Number of bookings for 2 adult(s): 26108
Number of bookings for 3 adult(s): 2317
Number of bookings for 4 adult(s): 16

Observations

  • 2 adults have the highest count of bookings(26108), followed by booking for 1 adult(7695), and booking for 3 adults(2317).
  • Most of the hotel stays are either for couples or two adults traveling together.

Observations on number of children¶

In [21]:
labeled_barplot(data, "no_of_children", perc=True)  ## Complete the code to create labeled_barplot for number of children
No description has been provided for this image
In [22]:
# replacing 9, and 10 children with 3
data["no_of_children"] = data["no_of_children"].replace([9, 10], 3)
In [23]:
def get_bookings_by_children(data, num_children):
  """
  Gets the number of bookings for a given number of children.

  Args:
    data: The pandas DataFrame containing booking data.
    num_children: The number of children to filter by.

  Returns:
    The number of bookings with the specified number of children.
  """
  bookings_for_children = data[data['no_of_children'] == num_children]
  num_bookings = bookings_for_children.shape[0]
  return num_bookings

# Get bookings for different children counts
children_counts = [0, 1, 2, 3, 4]
for num_children in children_counts:
  num_bookings = get_bookings_by_children(data, num_children)
  print(f"Number of bookings for {num_children} child(ren): {num_bookings}")
Number of bookings for 0 child(ren): 33577
Number of bookings for 1 child(ren): 1618
Number of bookings for 2 child(ren): 1058
Number of bookings for 3 child(ren): 22
Number of bookings for 4 child(ren): 0

Observations

The bar plot shows majority of bookings without children (33577)

Observations on number of weekend nights¶

In [24]:
labeled_barplot(data,'no_of_weekend_nights')
No description has been provided for this image
In [25]:
def get_bookings_by_weekend_nights(data, num_nights):
  """
  Gets the number of bookings for a given number of weekend nights.

  Args:
    data: The pandas DataFrame containing booking data.
    num_nights: The number of weekend nights to filter by.

  Returns:
    The number of bookings with the specified number of weekend nights.
  """
  bookings_for_nights = data[data['no_of_weekend_nights'] == num_nights]
  num_bookings = bookings_for_nights.shape[0]
  return num_bookings

# Get bookings for different weekend night counts
weekend_night_counts = [0, 1, 2, 3, 4, 5, 6, 7]
for num_nights in weekend_night_counts:
  num_bookings = get_bookings_by_weekend_nights(data, num_nights)
  print(f"Number of bookings for {num_nights} weekend night(s): {num_bookings}")
Number of bookings for 0 weekend night(s): 16872
Number of bookings for 1 weekend night(s): 9995
Number of bookings for 2 weekend night(s): 9071
Number of bookings for 3 weekend night(s): 153
Number of bookings for 4 weekend night(s): 129
Number of bookings for 5 weekend night(s): 34
Number of bookings for 6 weekend night(s): 20
Number of bookings for 7 weekend night(s): 1

Observations

  • The largest category, in fact, is bookings without any weekend nights, meaning most stays are weekday stays or do not extend to the weekend.
  • There are one or two weekend nights suggesting that some visitors spend the night during at least part of the weekend, presumably during shorter pleasure trips.
  • Overall, the data suggests that most bookings are completely within the week or have only one or two nights during the weekend, while longer weekend stays are very rare.

Observations on number of week nights¶

In [26]:
labeled_barplot(data,'no_of_week_nights')
No description has been provided for this image
In [27]:
# Filter bookings with more than 5 weeknights
filtered_bookings = data[data['no_of_week_nights'] > 5]

# Count filtered bookings
num_filtered_bookings = filtered_bookings.shape[0]
In [28]:
# Calculate the percentage
percentage = (num_filtered_bookings / data.shape[0]) * 100

# Print the result
print(f"Percentage of bookings extending beyond 5 weeknights: {percentage:.2f}%")
Percentage of bookings extending beyond 5 weeknights: 1.41%

Observations

  • The highest counts are for 1, 2, and 3 weeknights in bookings. This tells us that most guests prefer to stay on weekdays and not for so many days.
  • The more the weeknights, the lower the count of bookings, meaning longer weekday stays are less likely.
  • Less than 2% of all bookings extend beyond 5 weeknights; stays over 10 weeknights are extremely rare.

Observations on type of meal plan¶

In [29]:
labeled_barplot(data,'type_of_meal_plan')
No description has been provided for this image

Observations

  • Meal Plan 1 dominates: Most bookings have chosen Meal Plan 1; this therefore means that most guests prefer that meal plan.
  • Meal Plan 2 has fewer bookings
  • Quite a remarkable number of bookings did not select any meal plan; this may mean some guests want flexibility in their stay or independently arrange for meals.

Observations on required car parking space¶

In [30]:
labeled_barplot(data,'required_car_parking_space')
No description has been provided for this image
In [31]:
total_parking_required = data['required_car_parking_space'].sum()
In [32]:
print(f"Total car parking spaces required: {total_parking_required}")
Total car parking spaces required: 1124

Observations

  • Most bookings did not request a car parking space. Either the guests do not need parking or could be using alternative means of transportation that did not require parking.
  • Out of the 36275 bookings, only 1124 ( about 3%) required parking.

Observations on room type reserved¶

In [33]:
labeled_barplot(data,'room_type_reserved')
No description has been provided for this image
In [34]:
def calculate_room_type_percentages(data):
  """
  Calculates the percentage of reservations for each room type.

  Args:
    data: The pandas DataFrame containing booking data.

  Returns:
    A pandas Series containing the percentage of reservations for each room type.
  """
  room_type_counts = data['room_type_reserved'].value_counts()
  total_bookings = data.shape[0]
  room_type_percentages = (room_type_counts / total_bookings) * 100
  return room_type_percentages

# Calculate and print the percentages
room_type_percentages = calculate_room_type_percentages(data)
print(room_type_percentages)
room_type_reserved
Room_Type 1   77.54652
Room_Type 4   16.69745
Room_Type 6    2.66299
Room_Type 2    1.90765
Room_Type 5    0.73053
Room_Type 7    0.43556
Room_Type 3    0.01930
Name: count, dtype: float64

Observations

  • Most of the bookings fall under Room Type 1(77%), which means that this room type is most popular among its guests.
  • Room Type 4 enjoys a fair number of bookings but less when compared with Room Type 1. It accounts for 16%.
  • Room Types 2, 3, 5, 6, and 7 have comparably low booking counts, hence it is likely that these rooms are unpopular with guests. This could be one of the factors accounting for the increasing number of cancellations.
  • Room Type 1 serves most guests' needs or offers a better value for money for guests. Hence, their preferred choice.

Observations on lead time¶

In [35]:
histogram_boxplot(data, "lead_time") # Code to plot histogram and boxplot
No description has been provided for this image

Observations

  • From the box plot, it can be observed that most lead times fall in a relatively small range-as indicated by the IQR.
  • The presence of outliers to the right depicts some bookings are incredibly long in lead times as far as over 300 days.
  • The histplot shows a right-skewed distribution. The highest count is concentrated within the shortest lead times close to 0, which shows that a lot of bookings were made shortly before arrival.
  • The pattern indicates that last-minute bookings are common while some guests book in advance.

Observations on arrival year¶

In [36]:
labeled_barplot(data,'arrival_year')
No description has been provided for this image
In [37]:
def calculate_arrival_year_percentages(data):
  """
  Calculates the percentage of bookings for each arrival year.

  Args:
    data: The pandas DataFrame containing booking data.

  Returns:
    A pandas Series containing the percentage of bookings for each arrival year.
  """
  arrival_year_counts = data['arrival_year'].value_counts()
  total_bookings = data.shape[0]
  arrival_year_percentages = (arrival_year_counts / total_bookings) * 100
  return arrival_year_percentages

# Calculate and print the percentages
arrival_year_percentages = calculate_arrival_year_percentages(data)
print(arrival_year_percentages)
arrival_year
2018   82.04273
2017   17.95727
Name: count, dtype: float64

Observations

More guests arrived in 2018 (82%) than in 2017 (18%)

Observations on arrival month¶

In [38]:
labeled_barplot(data,'arrival_month')
No description has been provided for this image
In [39]:
def calculate_arrival_month_percentages(data):
  """
  Calculates the percentage of bookings for each arrival month.

  Args:
    data: The pandas DataFrame containing booking data.

  Returns:
    A pandas Series containing the percentage of bookings for each arrival month.
  """
  arrival_month_counts = data['arrival_month'].value_counts()
  total_bookings = data.shape[0]
  arrival_month_percentages = (arrival_month_counts / total_bookings) * 100
  return arrival_month_percentages

# Calculate and print the percentages
arrival_month_percentages = calculate_arrival_month_percentages(data)
print(arrival_month_percentages)
arrival_month
10   14.65748
9    12.71123
8    10.51137
6     8.82977
12    8.32805
11    8.21502
7     8.04962
4     7.54238
5     7.16196
3     6.50034
2     4.69745
1     2.79531
Name: count, dtype: float64

Observations

  • The month of October has the highest number of arrivals (14% of the total). It could be identified as the peak season for the hotel.
  • August and September also have a high number of bookings.
  • Arrival counts are stable from March to July, showing only moderate booking levels, hence probably consistent but not peak demand. This could be due to the Spring and Summer seasons.
  • January and February, depict lower counts of booking. This could probably be due to the Winter season.
  • INN Hotels Group could build its marketing and resource allocation strategy around the different seasons of the year to improve cost management and profitability.

Observations on market segment type¶

In [40]:
labeled_barplot(data,'market_segment_type')
No description has been provided for this image
In [41]:
def calculate_market_segment_percentages(data):
  """
  Calculates the percentage of bookings for each market segment type.

  Args:
    data: The pandas DataFrame containing booking data.

  Returns:
    A pandas Series containing the percentage of bookings for each market segment type.
  """
  market_segment_counts = data['market_segment_type'].value_counts()
  total_bookings = data.shape[0]
  market_segment_percentages = (market_segment_counts / total_bookings) * 100
  return market_segment_percentages

# Calculate and print the percentages
market_segment_percentages = calculate_market_segment_percentages(data)
print(market_segment_percentages)
market_segment_type
Online          63.99449
Offline         29.02274
Corporate        5.56030
Complementary    1.07788
Aviation         0.34459
Name: count, dtype: float64

Observations

  • The market is segmented into Aviation, Complementary, Corporate, Offline, and Online.
  • Online accounts for the largest number of guests (64%), followed by offline (29%).
  • Corporate bookings make up a smaller segment of the market (6%).
  • Aviation and Complementary contribute minimally to the hotel's bookings.
  • This affords INN Hotels Group the opportunity to do a Customer Profitability Analysis, allowing it to determine which customer segments are profitable and which ones are not, and to know where sales effort should be concentrated.

Observations on repeated guest¶

In [42]:
labeled_barplot(data,'repeated_guest') # Code to determine repeated guest
No description has been provided for this image
In [43]:
repeated_guest_percentage = (data['repeated_guest'].sum() / len(data)) * 100

print(f"Percentage of repeated guests: {repeated_guest_percentage:.2f}%")
Percentage of repeated guests: 2.56%

Observations

  • 2.56% of the customers are repeated guests. 97.44% did not repeat. This is a cause for concern and must be investigated. Could it be that they did not have a great customer experience the first time?
  • There is also the possibility that 97.44% are first time customers.

Observations on repeated guest¶

In [44]:
labeled_barplot(data,'no_of_previous_cancellations') # Code to visualize number of previous cancellations
No description has been provided for this image
In [45]:
# Percentage of bookings with previous cancellations
bookings_with_previous_cancellations = data[data['no_of_previous_cancellations'] > 0]
percentage_with_previous_cancellations = (len(bookings_with_previous_cancellations) / len(data)) * 100
print(f"Percentage of bookings with previous cancellations: {percentage_with_previous_cancellations:.2f}%")

# Percentage of bookings with NO previous cancellations
bookings_with_no_previous_cancellations = data[data['no_of_previous_cancellations'] == 0]
percentage_with_no_previous_cancellations = (len(bookings_with_no_previous_cancellations) / len(data)) * 100
print(f"Percentage of bookings with NO previous cancellations: {percentage_with_no_previous_cancellations:.2f}%")
Percentage of bookings with previous cancellations: 0.93%
Percentage of bookings with NO previous cancellations: 99.07%

Observations

  • A large number of bookings are made by guests with zero previous cancellations (99.07%).
  • There are very few customers who repeatedly cancel their bookings (0.93%) INN Hotels Group should continue with whatever effective cancellation policy they have in place to discourage frequent cancellation.

Observations on number of previous booking not canceled¶

In [46]:
histogram_boxplot(data,'no_of_previous_bookings_not_canceled') # Code to create histogram and boxplot
No description has been provided for this image
In [47]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 2, 6))
    else:
        plt.figure(figsize=(n + 2, 6))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        order=data[feature].value_counts().index[:n],
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

Observations

  • Most of the values are around zero; only a few guests had a large number of prior non-canceled bookings

Observations on average price per room¶

In [48]:
histogram_boxplot(data,'avg_price_per_room')  ## Code to create histogram_boxplot for average price per room
No description has been provided for this image
In [49]:
data[data["avg_price_per_room"] == 0]
Out[49]:
no_of_adults no_of_children no_of_weekend_nights no_of_week_nights type_of_meal_plan required_car_parking_space room_type_reserved lead_time arrival_year arrival_month arrival_date market_segment_type repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests booking_status
63 1 0 0 1 Meal Plan 1 0 Room_Type 1 2 2017 9 10 Complementary 0 0 0 0.00000 1 Not_Canceled
145 1 0 0 2 Meal Plan 1 0 Room_Type 1 13 2018 6 1 Complementary 1 3 5 0.00000 1 Not_Canceled
209 1 0 0 0 Meal Plan 1 0 Room_Type 1 4 2018 2 27 Complementary 0 0 0 0.00000 1 Not_Canceled
266 1 0 0 2 Meal Plan 1 0 Room_Type 1 1 2017 8 12 Complementary 1 0 1 0.00000 1 Not_Canceled
267 1 0 2 1 Meal Plan 1 0 Room_Type 1 4 2017 8 23 Complementary 0 0 0 0.00000 1 Not_Canceled
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
35983 1 0 0 1 Meal Plan 1 0 Room_Type 7 0 2018 6 7 Complementary 1 4 17 0.00000 1 Not_Canceled
36080 1 0 1 1 Meal Plan 1 0 Room_Type 7 0 2018 3 21 Complementary 1 3 15 0.00000 1 Not_Canceled
36114 1 0 0 1 Meal Plan 1 0 Room_Type 1 1 2018 3 2 Online 0 0 0 0.00000 0 Not_Canceled
36217 2 0 2 1 Meal Plan 1 0 Room_Type 2 3 2017 8 9 Online 0 0 0 0.00000 2 Not_Canceled
36250 1 0 0 2 Meal Plan 2 0 Room_Type 1 6 2017 12 10 Online 0 0 0 0.00000 0 Not_Canceled

545 rows × 18 columns

In [50]:
data.loc[data["avg_price_per_room"] == 0, "market_segment_type"].value_counts()
Out[50]:
count
market_segment_type
Complementary 354
Online 191

In [51]:
# Calculating the 25th quantile
Q1 = data["avg_price_per_room"].quantile(0.25) ## Code to calculate 25th quantile for average price per room

# Calculating the 75th quantile
Q3 = data["avg_price_per_room"].quantile(0.75)  ## Code to calculate 75th quantile for average price per room

# Calculating IQR
IQR = Q3 - Q1

# Calculating value of upper whisker
Upper_Whisker = Q3 + 1.5 * IQR
Upper_Whisker
Out[51]:
179.55
In [52]:
# assigning the outliers the value of upper whisker
data.loc[data["avg_price_per_room"] >= 500, "avg_price_per_room"] = Upper_Whisker
In [53]:
avg_price = data['avg_price_per_room'].mean()
print(f"The average price of bookings is: {avg_price:.2f} euros")
The average price of bookings is: 103.41 euros

Observations

  • Average price of booking is 103.41 euros
  • The boxplot provides a fairly wide interquartile range, which indicates some variability in the room prices. A number of outliers at the higher end may suggest that most prices fall within a certain range but that a few bookings have substantially higher average room prices-perhaps because some rooms are premium, or perhaps simply because of high demand.
  • The median reflects that most bookings are at prices that are not very high.
  • The histogram distribution is right-skewed and groups the room prices in the lower-middle range, around 50 to 150.
  • There are very few bookings at price points above 300 euros.

Observations on number of special requests¶

In [54]:
labeled_barplot(data,'no_of_special_requests')  ## Code to create labeled_barplot for number of special requests
No description has been provided for this image

Observations

  • Majority of guests did not make special requests. They did not need additional accommodation or services.
  • There were 11373 bookings with one special requests, followed by 4364 bookigs with 2 special requests, 675 bookings with 3 special requests, 78 bookigs with 4 special requests, and 8 bookings with 5 special requests.
  • The count of bookings decline as the number of special requests increase.

Observations on booking status¶

In [55]:
labeled_barplot(data,'booking_status')  ## Code to create labeled_barplot for booking status
No description has been provided for this image

Let's encode Canceled bookings to 1 and Not_Canceled as 0 for further analysis

In [56]:
data["booking_status"] = data["booking_status"].apply(
    lambda x: 1 if x == "Canceled" else 0
)

Observations

  • A greater part of the bookings(24,390) were not canceled, an indication that most guests conclude with their bookings.
  • 11,885 cancelled bookings is quite a high rate.
  • INN Hotels Group should implement flexible policies that will disuade guests from cancelling.

Bivariate Analysis¶

Correlation¶

In [57]:
cols_list = data.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(12, 7))
sns.heatmap(
    data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
No description has been provided for this image

Observations

  • The heatmap reveals some striking correlations: for repeat guests, more previous non-canceled bookings are seen with a low cancellation rate, whereas first-time guests cancel more. However, mostly the low values of most correlations may suggest many of these features being relatively independent, which could reflect rather diverse booking patterns across the guests.
  • Some relationship between no_of_previous_bookings_not_canceled and repeated_guest (0.54) can be established.
  • no_of_previous_bookings_not_canceled and no_of_previous_cancellations also show some slight relationship (0.47).
  • Some slight relationship can be established between avg_price_per_room and no_of_children (0.35)
  • The closer the correlation co-efficient is to 1, the greater the relationship. +1 indicates a perfect positive relationship. -1 indicates a perfect negative relationship. 0 indicates no relationship at all.

Creating functions that will help us with further analysis.

In [58]:
### function to plot distributions wrt target


def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0])

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
    )

    plt.tight_layout()
    plt.show()
In [59]:
# function to plot stacked bar chart


def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

Analysis to see how booking status impacts no. of adults, no. of children, no. of weekend nights, no. of week nights, type of meal plan, required car parking space, room type reserved, lead time, arrival year, arrival month, market segment type, repeated guest, no. of previous cancellations, no. of previous bookings not cancelled, no of special requests.¶

Booking Status vs Number of Adults¶

In [60]:
stacked_barplot(
    data,"no_of_adults", "booking_status"
)  # creates stacked barplot of booking status with respect to number of adults
booking_status      0      1    All
no_of_adults                       
All             24390  11885  36275
2               16989   9119  26108
1                5839   1856   7695
3                1454    863   2317
0                  95     44    139
4                  13      3     16
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Observations

  • Not cancelled(0) is more than cancelled(1) for booking status.
  • Number of adults has only a minor impact on cancellation rates. Booking status is not dependent on number of adults.

Booking Status vs Number of Children¶

In [61]:
stacked_barplot(
    data,"no_of_children", "booking_status"
)  # creates stacked barplot of booking status with respect to number of children
booking_status      0      1    All
no_of_children                     
All             24390  11885  36275
0               22695  10882  33577
1                1078    540   1618
2                 601    457   1058
3                  16      6     22
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Observations

  • There is no relationship between number of children and booking status. It is clear from previous analysis (univariate analysis) that booking is done mostly by adults, unaccompanied by children.

Booking Status vs Number of Weekend Nights¶

In [62]:
stacked_barplot(
    data,"no_of_weekend_nights", "booking_status"
)  # Code to create stacked barplot of booking status with respect to number of weekend nights
booking_status            0      1    All
no_of_weekend_nights                     
All                   24390  11885  36275
0                     11779   5093  16872
1                      6563   3432   9995
2                      5914   3157   9071
4                        46     83    129
3                        79     74    153
5                         5     29     34
6                         4     16     20
7                         0      1      1
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Observations

  • There is a relationship between booking status and number of weekend nights.
  • Bookings with higher weekend nights (eg 7, 5, 6, 4) have a higher rate of cancellation.
  • Bookings with lower weekend nights have lower rates of cancellation.
  • There is uncertainty with longer weekend nights, the possibility of cancellation.

Booking Status vs Number of Week Nights¶

In [63]:
stacked_barplot(
    data,"no_of_week_nights", "booking_status"
)  # Code to create stacked barplot of booking status with respect to number of week nights
booking_status         0      1    All
no_of_week_nights                     
All                24390  11885  36275
2                   7447   3997  11444
3                   5265   2574   7839
1                   6916   2572   9488
4                   1847   1143   2990
0                   1708    679   2387
5                    982    632   1614
6                    101     88    189
10                     9     53     62
7                     61     52    113
8                     30     32     62
9                     13     21     34
11                     3     14     17
15                     2      8     10
12                     2      7      9
13                     0      5      5
14                     3      4      7
16                     0      2      2
17                     1      2      3
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Observations

  • There is a relationship between booking status and number of week nights.
  • Bookings with higher week nights (eg 16, 13, 10, 11, 12, 17, 6, 14) have a higher rate of cancellation.
  • Bookings with lower week nights have lower rates of cancellation.
  • There is uncertainty with longer week nights, the possibility of cancellation. Shorter week nights are likely not to be cancelled.

Booking Status vs type of meal plan¶

In [64]:
stacked_barplot(
    data,"type_of_meal_plan", "booking_status"
)  # Code to create stacked barplot of booking status with respect to type of meal plan
booking_status         0      1    All
type_of_meal_plan                     
All                24390  11885  36275
Meal Plan 1        19156   8679  27835
Not Selected        3431   1699   5130
Meal Plan 2         1799   1506   3305
Meal Plan 3            4      1      5
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Observations

  • Some relationship can be observed between the type of meal plan chosen and the likelihood of cancellation.
  • Meal plan 2 has the highest cancellation rate, followed by Meal plan 1.

Booking Status vs Required car parking space¶

In [65]:
stacked_barplot(
    data,"required_car_parking_space", "booking_status"
)  # Code to create stacked barplot of booking status with respect to required car parking space
booking_status                  0      1    All
required_car_parking_space                     
All                         24390  11885  36275
0                           23380  11771  35151
1                            1010    114   1124
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Observations

  • Some relationship can be established between booking status and reserved car park space.
  • Guests who request for car parking space are likely not to cancel

Booking Status vs Room Type Reserved¶

In [66]:
stacked_barplot(
    data,"room_type_reserved", "booking_status"
)  # Code to create stacked barplot of booking status with respect to room type reserved
booking_status          0      1    All
room_type_reserved                     
All                 24390  11885  36275
Room_Type 1         19058   9072  28130
Room_Type 4          3988   2069   6057
Room_Type 6           560    406    966
Room_Type 2           464    228    692
Room_Type 5           193     72    265
Room_Type 7           122     36    158
Room_Type 3             5      2      7
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Observation

The relationship between booking status and the possibility of cancellation is not strong.

Booking Status vs Arrival year¶

In [67]:
stacked_barplot(
    data,"arrival_year", "booking_status"
)  # Code to create stacked barplot of booking status with respect to arrival year
booking_status      0      1    All
arrival_year                       
All             24390  11885  36275
2018            18837  10924  29761
2017             5553    961   6514
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Observations

  • There were more cancellations in 2018 than in 2017. That indicates some relationship between booking status and arrival year. The higher cancellations in 2018 could be due to several factors such as changes in economic conditions, changes in taste and behavior etc.

Booking Status vs Arrival month¶

In [68]:
stacked_barplot(
    data,"arrival_month", "booking_status"
)  # Code to create stacked barplot of booking status with respect to arrival month
booking_status      0      1    All
arrival_month                      
All             24390  11885  36275
10               3437   1880   5317
9                3073   1538   4611
8                2325   1488   3813
7                1606   1314   2920
6                1912   1291   3203
4                1741    995   2736
5                1650    948   2598
11               2105    875   2980
3                1658    700   2358
2                1274    430   1704
12               2619    402   3021
1                 990     24   1014
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Observations

  • We can establish a relationship between arrival month and the likelihood of cancellation.

  • Cancellation rates are higher in the summer ( eg June, July, and August) and lower in the winter (eg Dec, Jan, February). In between, cancellations are fairly stable.

Booking Status vs Market Segment Type¶

In [69]:
stacked_barplot(
    data,"market_segment_type", "booking_status"
)  # Code to create stacked barplot of booking status with respect to market segment type
booking_status           0      1    All
market_segment_type                     
All                  24390  11885  36275
Online               14739   8475  23214
Offline               7375   3153  10528
Corporate             1797    220   2017
Aviation                88     37    125
Complementary          391      0    391
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Observation

  • Online segment has a highest cancellation rate. This is followed by offline. It is interesting to note that's where INN Hotels Group gets most of its business from.
  • There's high probability that more cancellations would come from online and offline, followed by aviation.
  • Some relationship can be established between booking status and market segment type.

Booking Status vs Repeated Guest¶

In [70]:
stacked_barplot(
    data,"repeated_guest", "booking_status"
)  # Code to create stacked barplot of booking status with respect to repeated guest
booking_status      0      1    All
repeated_guest                     
All             24390  11885  36275
0               23476  11869  35345
1                 914     16    930
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Observations

  • Repeated guests have lower cancellation rates than non-repeated guests.
  • There is a high correlation between repeated guests and non- cancellation. Customer loyalty programs should be introduced to continue to enhance this.

Booking Status vs Number of Previous Cancellations¶

In [71]:
stacked_barplot(
    data,"no_of_previous_cancellations", "booking_status"
)  # Code to create stacked barplot of booking status with respect to no. of previous cancellations
booking_status                    0      1    All
no_of_previous_cancellations                     
All                           24390  11885  36275
0                             24068  11869  35937
1                               187     11    198
13                                0      4      4
3                                42      1     43
2                                46      0     46
4                                10      0     10
5                                11      0     11
6                                 1      0      1
11                               25      0     25
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Observations

  • We can establish some relationship between the number of previous cancellations and the current booking. Guests with high prior cancellations are much more likely to cancel, while those guests who have fewer or moderate past cancellations have much lower risks.
  • We can gain insight into areas where the hotel could identify the guests that are at the higher risk of cancellation based on their history of booking.

Booking Status vs Number of Previous bookings not cancelled¶

In [72]:
stacked_barplot(
    data,"no_of_previous_bookings_not_canceled", "booking_status"
)  # Code to create stacked barplot of booking status with respect to no. of previous bookings not cancelled
booking_status                            0      1    All
no_of_previous_bookings_not_canceled                     
All                                   24390  11885  36275
0                                     23585  11878  35463
1                                       224      4    228
12                                       11      1     12
4                                        64      1     65
6                                        35      1     36
2                                       112      0    112
44                                        2      0      2
43                                        1      0      1
42                                        1      0      1
41                                        1      0      1
40                                        1      0      1
38                                        1      0      1
39                                        1      0      1
46                                        1      0      1
37                                        1      0      1
36                                        1      0      1
35                                        1      0      1
45                                        1      0      1
48                                        2      0      2
47                                        1      0      1
33                                        1      0      1
49                                        1      0      1
50                                        1      0      1
51                                        1      0      1
52                                        1      0      1
53                                        1      0      1
54                                        1      0      1
55                                        1      0      1
56                                        1      0      1
57                                        1      0      1
58                                        1      0      1
34                                        1      0      1
31                                        2      0      2
32                                        2      0      2
3                                        80      0     80
5                                        60      0     60
7                                        24      0     24
8                                        23      0     23
9                                        19      0     19
10                                       19      0     19
11                                       15      0     15
13                                        7      0      7
14                                        9      0      9
15                                        8      0      8
16                                        7      0      7
17                                        6      0      6
18                                        6      0      6
19                                        6      0      6
20                                        6      0      6
21                                        6      0      6
22                                        6      0      6
23                                        3      0      3
24                                        3      0      3
25                                        3      0      3
26                                        2      0      2
27                                        3      0      3
28                                        2      0      2
29                                        2      0      2
30                                        2      0      2
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
In [73]:
data[data["no_of_previous_bookings_not_canceled"] != 0][
    "booking_status"
].value_counts()
Out[73]:
count
booking_status
0 805
1 7

In [74]:
# Filter for non-canceled bookings
non_canceled_bookings = data[data['booking_status'] == 0]

# Count non-canceled bookings
num_non_canceled = non_canceled_bookings.shape[0]

# Calculate percentage
percentage_non_canceled = (num_non_canceled / data.shape[0]) * 100

# Print the result
print(f"Percentage of bookings not canceled: {percentage_non_canceled:.2f}%")
Percentage of bookings not canceled: 67.24%

Observations

  • A strong relationship between the number of previously non-canceled bookings and the likelihood of cancellation can be established.

Booking Status vs No. of Special requests¶

In [75]:
stacked_barplot(
    data,"no_of_special_requests", "booking_status"
)  # Code to create stacked barplot of booking status with respect to no. of special requests
booking_status              0      1    All
no_of_special_requests                     
All                     24390  11885  36275
0                       11232   8545  19777
1                        8670   2703  11373
2                        3727    637   4364
3                         675      0    675
4                          78      0     78
5                           8      0      8
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Observations

  • We can establish a relationship between the booking status and the number of special requests.
  • As the number of special requests increase, the number of booking cancellations decrease.

Question 1: What are the busiest months in the hotel?¶

In [76]:
# grouping the data on arrival months and extracting the count of bookings
monthly_data = data.groupby(["arrival_month"])["booking_status"].count()

# creating a dataframe with months and count of customers in each month
monthly_data = pd.DataFrame(
    {"Month": list(monthly_data.index), "Guests": list(monthly_data.values)}
)

# plotting the trend over different months
plt.figure(figsize=(10, 5))
sns.lineplot(data=monthly_data, x="Month", y="Guests")
plt.show()
No description has been provided for this image
In [77]:
def calculate_arrival_month_percentages(data):
  """
  Calculates the percentage of bookings for each arrival month.

  Args:
    data: The pandas DataFrame containing booking data.

  Returns:
    A pandas Series containing the percentage of bookings for each arrival month.
  """
  arrival_month_counts = data['arrival_month'].value_counts()
  total_bookings = data.shape[0]
  arrival_month_percentages = (arrival_month_counts / total_bookings) * 100
  return arrival_month_percentages

# Calculate and print the percentages
arrival_month_percentages = calculate_arrival_month_percentages(data)
print(arrival_month_percentages)
arrival_month
10   14.65748
9    12.71123
8    10.51137
6     8.82977
12    8.32805
11    8.21502
7     8.04962
4     7.54238
5     7.16196
3     6.50034
2     4.69745
1     2.79531
Name: count, dtype: float64

Observations

  • October, September, and August are the busiest months in the hotel, October being the highest.

Question 2: Which market segment do most of the guests come from?¶

In [78]:
labeled_barplot(data,'market_segment_type')
def calculate_market_segment_percentages(data):
  """
  Calculates the percentage of bookings for each market segment type.

  Args:
    data: The pandas DataFrame containing booking data.

  Returns:
    A pandas Series containing the percentage of bookings for each market segment type.
  """
  market_segment_counts = data['market_segment_type'].value_counts()
  total_bookings = data.shape[0]
  market_segment_percentages = (market_segment_counts / total_bookings) * 100
  return market_segment_percentages

# Calculate and print the percentages
market_segment_percentages = calculate_market_segment_percentages(data)
print(market_segment_percentages)
No description has been provided for this image
market_segment_type
Online          63.99449
Offline         29.02274
Corporate        5.56030
Complementary    1.07788
Aviation         0.34459
Name: count, dtype: float64

Observations

  • Most guests come from the online segment, representing 64%
  • 29% come from offline

Question 3: Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?¶

In [79]:
plt.figure(figsize=(10, 6))
sns.boxplot(
    data=data, x="market_segment_type", y="avg_price_per_room"
)
plt.show()
No description has been provided for this image
In [80]:
# plot of data frame made of just the market_segment_type and avg_price_per_room

total = len(data["market_segment_type"])  # length of the column
count = data["market_segment_type"].nunique()  # counts amount of unique values for
plt.figure(figsize=(12, 7))  # sets figure size

plt.xticks(fontsize=15)  # rotates tick labels 90 degrees
#     create barplot
ax = sns.barplot(
    data=data[["market_segment_type", "avg_price_per_room"]],
    x="market_segment_type",
    y="avg_price_per_room",
    palette="deep",  # sets color for plot
    ci=None,
)
#     creates labels on top of bars that are either counts or perentages of the whole column depending perc value
for p in ax.patches:
    label = "€{:.2f}".format(p.get_height())  # count of each level of the category

    x = p.get_x() + p.get_width() / 2  # width of the plot
    y = p.get_height()  # height of the plot

    #         edits the labels to be the correct size and placement
    ax.annotate(
        label,
        (x, y),
        ha="center",
        va="center",
        size=15,
        xytext=(0, 5),
        textcoords="offset points",
    )  # annotate the percentage or the count

plt.savefig(
    "avg_room_price_per_market_segment.jpg", bbox_inches="tight"
)  # saves plot as JPEG
plt.show()  # show the plot
No description has been provided for this image

Observations

The differences in room prices in different market segments are:

  • Online €112.26
  • Offline €91.60
  • Corporate €82.91
  • Aviation €100.70
  • Complementary €3.14

Question 4: What percentage of bookings are cancelled?¶

In [81]:
booking_status_counts = data['booking_status'].value_counts()
In [82]:
canceled_percentage = (booking_status_counts[1] / len(data)) * 100
print(f"Percentage of bookings canceled: {canceled_percentage:.2f}%")
Percentage of bookings canceled: 32.76%

Observations

  • Percentage of booking cancelled is 32.76%

Question 5: Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?¶

In [83]:
data.groupby("booking_status")["repeated_guest"].value_counts()
Out[83]:
count
booking_status repeated_guest
0 0 23476
1 914
1 0 11869
1 16

In [84]:
# Filter for repeating guests
repeating_guests = data[data['repeated_guest'] == 1]

# Calculate cancellations within this group
repeating_guest_cancellations = repeating_guests[repeating_guests['booking_status'] == 1].shape[0]

# Calculate the percentage
percentage_repeating_guest_cancellations = (repeating_guest_cancellations / repeating_guests.shape[0]) * 100

# Print the result
print(f"Percentage of repeating guests who cancel: {percentage_repeating_guest_cancellations:.2f}%")
Percentage of repeating guests who cancel: 1.72%

Observations

  • Percentage of repeating guests who cancel is 1.7%

Question 6: Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?¶

In [85]:
stacked_barplot(
    data,"no_of_special_requests", "booking_status"
)  # Code to create stacked barplot of booking status with respect to no. of special requests
booking_status              0      1    All
no_of_special_requests                     
All                     24390  11885  36275
0                       11232   8545  19777
1                        8670   2703  11373
2                        3727    637   4364
3                         675      0    675
4                          78      0     78
5                           8      0      8
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Observations

  • We can establish a relationship between the booking status and the number of special requests.
  • As the number of special requests increase, the number of booking cancellations decrease.
  • These special requirements affect booking cancellation.

Data Preprocessing¶

  • Missing value treatment (if needed)
  • Feature engineering (if needed)
  • Outlier detection and treatment (if needed)
  • Preparing data for modeling
  • Any other preprocessing steps (if needed)

Missing value treatement¶

In [86]:
data.isnull().sum() #Checking for missing values
Out[86]:
0
no_of_adults 0
no_of_children 0
no_of_weekend_nights 0
no_of_week_nights 0
type_of_meal_plan 0
required_car_parking_space 0
room_type_reserved 0
lead_time 0
arrival_year 0
arrival_month 0
arrival_date 0
market_segment_type 0
repeated_guest 0
no_of_previous_cancellations 0
no_of_previous_bookings_not_canceled 0
avg_price_per_room 0
no_of_special_requests 0
booking_status 0

Observations

No missing value in the dataset

Outlier Check¶

  • Let's check for outliers in the data.
In [87]:
# outlier detection using boxplot
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
# dropping booking_status
numeric_columns.remove("booking_status")

plt.figure(figsize=(15, 12))

for i, variable in enumerate(numeric_columns):
    plt.subplot(4, 4, i + 1)
    plt.boxplot(data[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()
No description has been provided for this image

Observations

There are a couple of outliers. Since they are part of the dataset, they will not be treated.

Model Building¶

Building a Logistic Regression model¶

In [88]:
#Computing different functions to check performance
def model_performance_classification_statsmodels(model,predictors,target,threshold=0.5):
    #Checking which probabilities are greater than the threshold
    pred_temp = model.predict(predictors)>threshold
    #Rounding off the variables
    pred = np.round(pred_temp)

    #Metrics being used for model performance
    acc = accuracy_score(target,pred) #To compute the accuracy score
    recall = recall_score(target,pred) #To compute the recall score
    precision = precision_score(target,pred) #To compute the precision score
    f1 = f1_score(target,pred) #To compute the f1 score

    #Creating a dataframe for the metrics
    df_perf = pd.DataFrame({'Accuracy':acc,'Recall':recall,'Precision':precision,'F1':f1,},index=[0],)

    return df_perf
In [89]:
def model_performance_classification_statsmodels(model, predictors, target, threshold=0.5):
    # Convert predictors to a NumPy array with a suitable data type
    predictors = np.asarray(predictors, dtype=np.float64)
    #Checking which probabilities are greater than the threshold
    pred_temp = model.predict(predictors)>threshold
    #Rounding off the variables
    pred = np.round(pred_temp)
    #Calculating the performance metrics
    Accuracy = accuracy_score(target, pred)
    #print('Accuracy', Accuracy)
    Recall = recall_score(target, pred)
    #print('Recall', Recall)
    Precision = precision_score(target, pred)
    #print('Precision', Precision)
    F1_Score = f1_score(target, pred)
    #print('F1_Score', F1_Score)

    #Creating a dataframe to display the results
    df_perf = pd.DataFrame(
        {
            "Accuracy": [Accuracy],
            "Recall": [Recall],
            "Precision": [Precision],
            "F1_Score": [F1_Score],
        }
    )
    return df_perf
In [90]:
#Computing the confusion matrix
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """
    y_pred = model.predict(predictors) > threshold
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Data Preparation for Modeling

In [91]:
#Independent and dependent variables defined
x = data.drop(['booking_status'],axis=1)
y = data['booking_status']

#Adding constant
X = sm.add_constant(x)
X = pd.get_dummies(X,drop_first=True)

#Create a train and test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)
In [92]:
print('Shape of the training set:',X_train.shape)
print('Shape of test set:', X_test.shape)
print('Percentage of Classes in Training Set:',y_train.value_counts(normalize=True))
print('Percentage of Classes in Test Set:',y_test.value_counts(normalize=True))
Shape of the training set: (25392, 28)
Shape of test set: (10883, 28)
Percentage of Classes in Training Set: booking_status
0   0.67064
1   0.32936
Name: proportion, dtype: float64
Percentage of Classes in Test Set: booking_status
0   0.67638
1   0.32362
Name: proportion, dtype: float64
In [93]:
# creating dummy variables
X = pd.get_dummies(
    X,
    columns=X.select_dtypes(include=["object", "category"]).columns.tolist(),
    drop_first=True,
)  ## Complete the code to create dummies for independent features

X.head()
Out[93]:
const no_of_adults no_of_children no_of_weekend_nights no_of_week_nights required_car_parking_space lead_time arrival_year arrival_month arrival_date repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests type_of_meal_plan_Meal Plan 2 type_of_meal_plan_Meal Plan 3 type_of_meal_plan_Not Selected room_type_reserved_Room_Type 2 room_type_reserved_Room_Type 3 room_type_reserved_Room_Type 4 room_type_reserved_Room_Type 5 room_type_reserved_Room_Type 6 room_type_reserved_Room_Type 7 market_segment_type_Complementary market_segment_type_Corporate market_segment_type_Offline market_segment_type_Online
0 1.00000 2 0 1 2 0 224 2017 10 2 0 0 0 65.00000 0 False False False False False False False False False False False True False
1 1.00000 2 0 2 3 0 5 2018 11 6 0 0 0 106.68000 1 False False True False False False False False False False False False True
2 1.00000 1 0 2 1 0 1 2018 2 28 0 0 0 60.00000 0 False False False False False False False False False False False False True
3 1.00000 2 0 0 2 0 211 2018 5 20 0 0 0 100.00000 0 False False False False False False False False False False False False True
4 1.00000 2 0 1 1 0 48 2018 4 11 0 0 0 94.50000 0 False False True False False False False False False False False False True
In [94]:
# Converting the input attributes into float type for modeling
X = X.astype(float)
X.head()
Out[94]:
const no_of_adults no_of_children no_of_weekend_nights no_of_week_nights required_car_parking_space lead_time arrival_year arrival_month arrival_date repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests type_of_meal_plan_Meal Plan 2 type_of_meal_plan_Meal Plan 3 type_of_meal_plan_Not Selected room_type_reserved_Room_Type 2 room_type_reserved_Room_Type 3 room_type_reserved_Room_Type 4 room_type_reserved_Room_Type 5 room_type_reserved_Room_Type 6 room_type_reserved_Room_Type 7 market_segment_type_Complementary market_segment_type_Corporate market_segment_type_Offline market_segment_type_Online
0 1.00000 2.00000 0.00000 1.00000 2.00000 0.00000 224.00000 2017.00000 10.00000 2.00000 0.00000 0.00000 0.00000 65.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 1.00000 0.00000
1 1.00000 2.00000 0.00000 2.00000 3.00000 0.00000 5.00000 2018.00000 11.00000 6.00000 0.00000 0.00000 0.00000 106.68000 1.00000 0.00000 0.00000 1.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 1.00000
2 1.00000 1.00000 0.00000 2.00000 1.00000 0.00000 1.00000 2018.00000 2.00000 28.00000 0.00000 0.00000 0.00000 60.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 1.00000
3 1.00000 2.00000 0.00000 0.00000 2.00000 0.00000 211.00000 2018.00000 5.00000 20.00000 0.00000 0.00000 0.00000 100.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 1.00000
4 1.00000 2.00000 0.00000 1.00000 1.00000 0.00000 48.00000 2018.00000 4.00000 11.00000 0.00000 0.00000 0.00000 94.50000 0.00000 0.00000 0.00000 1.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 1.00000

Building the Model

In [95]:
#Fitting the logistic regression model
logit = sm.Logit(y_train,X_train.astype(float))
lg = logit.fit(disp=False)
print(lg.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:         booking_status   No. Observations:                25392
Model:                          Logit   Df Residuals:                    25364
Method:                           MLE   Df Model:                           27
Date:                Thu, 07 Nov 2024   Pseudo R-squ.:                  0.3292
Time:                        17:28:26   Log-Likelihood:                -10794.
converged:                      False   LL-Null:                       -16091.
Covariance Type:            nonrobust   LLR p-value:                     0.000
========================================================================================================
                                           coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------
const                                 -922.8266    120.832     -7.637      0.000   -1159.653    -686.000
no_of_adults                             0.1137      0.038      3.019      0.003       0.040       0.188
no_of_children                           0.1580      0.062      2.544      0.011       0.036       0.280
no_of_weekend_nights                     0.1067      0.020      5.395      0.000       0.068       0.145
no_of_week_nights                        0.0397      0.012      3.235      0.001       0.016       0.064
required_car_parking_space              -1.5943      0.138    -11.565      0.000      -1.865      -1.324
lead_time                                0.0157      0.000     58.863      0.000       0.015       0.016
arrival_year                             0.4561      0.060      7.617      0.000       0.339       0.573
arrival_month                           -0.0417      0.006     -6.441      0.000      -0.054      -0.029
arrival_date                             0.0005      0.002      0.259      0.796      -0.003       0.004
repeated_guest                          -2.3472      0.617     -3.806      0.000      -3.556      -1.139
no_of_previous_cancellations             0.2664      0.086      3.108      0.002       0.098       0.434
no_of_previous_bookings_not_canceled    -0.1727      0.153     -1.131      0.258      -0.472       0.127
avg_price_per_room                       0.0188      0.001     25.396      0.000       0.017       0.020
no_of_special_requests                  -1.4689      0.030    -48.782      0.000      -1.528      -1.410
type_of_meal_plan_Meal Plan 2            0.1756      0.067      2.636      0.008       0.045       0.306
type_of_meal_plan_Meal Plan 3           17.3584   3987.836      0.004      0.997   -7798.656    7833.373
type_of_meal_plan_Not Selected           0.2784      0.053      5.247      0.000       0.174       0.382
room_type_reserved_Room_Type 2          -0.3605      0.131     -2.748      0.006      -0.618      -0.103
room_type_reserved_Room_Type 3          -0.0012      1.310     -0.001      0.999      -2.568       2.566
room_type_reserved_Room_Type 4          -0.2823      0.053     -5.304      0.000      -0.387      -0.178
room_type_reserved_Room_Type 5          -0.7189      0.209     -3.438      0.001      -1.129      -0.309
room_type_reserved_Room_Type 6          -0.9501      0.151     -6.274      0.000      -1.247      -0.653
room_type_reserved_Room_Type 7          -1.4003      0.294     -4.770      0.000      -1.976      -0.825
market_segment_type_Complementary      -40.5975   5.65e+05  -7.19e-05      1.000   -1.11e+06    1.11e+06
market_segment_type_Corporate           -1.1924      0.266     -4.483      0.000      -1.714      -0.671
market_segment_type_Offline             -2.1946      0.255     -8.621      0.000      -2.694      -1.696
market_segment_type_Online              -0.3995      0.251     -1.590      0.112      -0.892       0.093
========================================================================================================

Observations

  • Lead time, parking car space, repeat guests, special requests, and the room price are the major predictors of cancellation.
  • The longer the lead time and higher the room price, the greater the likelihood of cancellation, while repeat guests, guests with special requests, and those requiring parking are more committed to their bookings.
  • However, there are some anomalies, such as large standard errors for some categories.
  • Variables like no_of_adults, no_of_children, no_of_weekend_nights, required_car_parking_space, lead_time, arrival_year, repeated_guest, no_of_previous_cancellations, avg_price_per_room, and no_of_special_requests have statistically significant coefficients, with p-values less than 0.05. This means those features are meaningful predictors of the booking status.
  • The pseudo R-squared value of 0.3292 suggests that about 32.92% of the variance in booking status being canceled or not canceled is explained by this model.
In [96]:
#Training Performance
print('Training Performance')
model_performance_classification_statsmodels(lg,X_train,y_train)
Training Performance
Out[96]:
Accuracy Recall Precision F1_Score
0 0.80600 0.63410 0.73971 0.68285

Checking Multicollinearity¶

  • In order to make statistical inferences from a logistic regression model, it is important to ensure that there is no multicollinearity present in the data.
In [97]:
# Use Varianc Inflation Factor(VIF) to fix the multicollienarity issue
def checking_vif(predictors):
    # Select only numeric columns
    numeric_predictors = predictors.select_dtypes(include=['number'])

    # Drop rows with any NaN values in numeric columns
    numeric_predictors = numeric_predictors.dropna()

    vif = pd.DataFrame()
    vif['Features'] = numeric_predictors.columns

    # Calculating VIF for each feature
    vif['VIF'] = [variance_inflation_factor(numeric_predictors.values, i) for i in range(len(numeric_predictors.columns))]

    return vif
In [98]:
checking_vif(X_train)
Out[98]:
Features VIF
0 const 34866123.37597
1 no_of_adults 1.21461
2 no_of_children 1.17572
3 no_of_weekend_nights 1.05287
4 no_of_week_nights 1.06931
5 required_car_parking_space 1.03415
6 lead_time 1.15845
7 arrival_year 1.26424
8 arrival_month 1.24097
9 arrival_date 1.00486
10 repeated_guest 1.56324
11 no_of_previous_cancellations 1.37579
12 no_of_previous_bookings_not_canceled 1.63420
13 avg_price_per_room 1.39658
14 no_of_special_requests 1.11675

Observation

  • VIF < 5: The variable has low multicollinearity and no significant issue
In [99]:
# Convert all columns in X to numeric if possible.
# Errors='coerce' will replace non-numeric values with NaN.
for col in X.select_dtypes(include=['object']).columns:
    try:
        # Explicitly convert to numeric, handle errors by setting to NaN
        X[col] = pd.to_numeric(X[col], errors='coerce')
    except (ValueError, TypeError):
        # If conversion fails, drop the column and print a warning
        print(f"Column '{col}' cannot be converted to numeric and will be dropped.")
        X = X.drop(columns=[col])

# Impute NaN values using a strategy appropriate for numeric data
# Consider using mean, median, or a more sophisticated imputation method.
# Here, we use the mean for demonstration.
for col in X.select_dtypes(include=np.number).columns:
    X[col] = X[col].fillna(X[col].mean())

# Check if y is a pandas Series and convert it to a numpy array
if isinstance(y, pd.Series):
    y = y.to_numpy()

# *** The fix: Convert all columns in X to numeric dtype ***
X = X.astype(float)

Remove P-values

In [100]:
#Initial list of columns
cols = X_train.columns.tolist()

#Setting an initial max p-value
max_p_value = 1

while len(cols) > 0:
    #Defining the train set
    x_train_aux = X_train[cols]

    # *** The fix: Ensure all columns in x_train_aux are numeric ***
    for col in x_train_aux.select_dtypes(include=['object']).columns:
        try:
            x_train_aux[col] = pd.to_numeric(x_train_aux[col], errors='coerce')
        except (ValueError, TypeError):
            print(f"Column '{col}' in x_train_aux cannot be converted to numeric and will be dropped.")
            x_train_aux = x_train_aux.drop(columns=[col])
            # If the column is dropped, also remove it from 'cols' to avoid further errors
            if col in cols:
                cols.remove(col)

    x_train_aux = x_train_aux.astype(float)


    #Fitting the model
    model = sm.Logit(y_train, x_train_aux).fit(disp=False)

    #Getting the p-values and the maximum p-value
    p_values = model.pvalues
    max_p_value = max(p_values)

    #Name of the variable with maximum p-value
    feature_with_p_max = p_values.idxmax()

    if max_p_value > 0.05:
        cols.remove(feature_with_p_max)
    else:
        break

selected_features = cols
print(selected_features)
['const', 'no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'required_car_parking_space', 'lead_time', 'arrival_year', 'arrival_month', 'repeated_guest', 'no_of_previous_cancellations', 'avg_price_per_room', 'no_of_special_requests', 'type_of_meal_plan_Meal Plan 2', 'type_of_meal_plan_Not Selected', 'room_type_reserved_Room_Type 2', 'room_type_reserved_Room_Type 4', 'room_type_reserved_Room_Type 5', 'room_type_reserved_Room_Type 6', 'room_type_reserved_Room_Type 7', 'market_segment_type_Corporate', 'market_segment_type_Offline']
In [101]:
X_train1 = X_train[selected_features]
X_test1 = X_test[selected_features]

New Logit Model

In [102]:
logit1 = sm.Logit(y_train, X_train1.astype(float))
lg1 = logit1.fit(disp=False)
print(lg1.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:         booking_status   No. Observations:                25392
Model:                          Logit   Df Residuals:                    25370
Method:                           MLE   Df Model:                           21
Date:                Thu, 07 Nov 2024   Pseudo R-squ.:                  0.3282
Time:                        17:28:29   Log-Likelihood:                -10810.
converged:                       True   LL-Null:                       -16091.
Covariance Type:            nonrobust   LLR p-value:                     0.000
==================================================================================================
                                     coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
const                           -915.6391    120.471     -7.600      0.000   -1151.758    -679.520
no_of_adults                       0.1088      0.037      2.914      0.004       0.036       0.182
no_of_children                     0.1531      0.062      2.470      0.014       0.032       0.275
no_of_weekend_nights               0.1086      0.020      5.498      0.000       0.070       0.147
no_of_week_nights                  0.0417      0.012      3.399      0.001       0.018       0.066
required_car_parking_space        -1.5947      0.138    -11.564      0.000      -1.865      -1.324
lead_time                          0.0157      0.000     59.213      0.000       0.015       0.016
arrival_year                       0.4523      0.060      7.576      0.000       0.335       0.569
arrival_month                     -0.0425      0.006     -6.591      0.000      -0.055      -0.030
repeated_guest                    -2.7367      0.557     -4.916      0.000      -3.828      -1.646
no_of_previous_cancellations       0.2288      0.077      2.983      0.003       0.078       0.379
avg_price_per_room                 0.0192      0.001     26.336      0.000       0.018       0.021
no_of_special_requests            -1.4698      0.030    -48.884      0.000      -1.529      -1.411
type_of_meal_plan_Meal Plan 2      0.1642      0.067      2.469      0.014       0.034       0.295
type_of_meal_plan_Not Selected     0.2860      0.053      5.406      0.000       0.182       0.390
room_type_reserved_Room_Type 2    -0.3552      0.131     -2.709      0.007      -0.612      -0.098
room_type_reserved_Room_Type 4    -0.2828      0.053     -5.330      0.000      -0.387      -0.179
room_type_reserved_Room_Type 5    -0.7364      0.208     -3.535      0.000      -1.145      -0.328
room_type_reserved_Room_Type 6    -0.9682      0.151     -6.403      0.000      -1.265      -0.672
room_type_reserved_Room_Type 7    -1.4343      0.293     -4.892      0.000      -2.009      -0.860
market_segment_type_Corporate     -0.7913      0.103     -7.692      0.000      -0.993      -0.590
market_segment_type_Offline       -1.7854      0.052    -34.363      0.000      -1.887      -1.684
==================================================================================================
In [103]:
print('Training Performance')
model_performance_classification_statsmodels(lg1,X_train1,y_train)
Training Performance
Out[103]:
Accuracy Recall Precision F1_Score
0 0.80545 0.63267 0.73907 0.68174

Observations

F1 score changed slightly. This is not a significant change in the Logistic Regression.

Converting coefficients to odds¶

  • The coefficients of the logistic regression model are in terms of log(odd), to find the odds we have to take the exponential of the coefficients.
  • Therefore, odds = exp(b)
  • The percentage change in odds is given as odds = (exp(b) - 1) * 100
In [104]:
#Converting coefficients to odds
odds = np.exp(lg1.params.astype(np.float64)) # Convert lg1.params to float64

#Finding the percentage change
perc_change_odds = (np.exp(lg1.params.astype(np.float64)) - 1) * 100 # Convert lg1.params to float64

#Removing limit from number of columns to display
pd.set_option("display.max_columns", None)

#Adding the odds to a dataframe
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train1.columns).T
Out[104]:
const no_of_adults no_of_children no_of_weekend_nights no_of_week_nights required_car_parking_space lead_time arrival_year arrival_month repeated_guest no_of_previous_cancellations avg_price_per_room no_of_special_requests type_of_meal_plan_Meal Plan 2 type_of_meal_plan_Not Selected room_type_reserved_Room_Type 2 room_type_reserved_Room_Type 4 room_type_reserved_Room_Type 5 room_type_reserved_Room_Type 6 room_type_reserved_Room_Type 7 market_segment_type_Corporate market_segment_type_Offline
Odds 0.00000 1.11491 1.16546 1.11470 1.04258 0.20296 1.01583 1.57195 0.95839 0.06478 1.25712 1.01937 0.22996 1.17846 1.33109 0.70104 0.75364 0.47885 0.37977 0.23827 0.45326 0.16773
Change_odd% -100.00000 11.49096 16.54593 11.46966 4.25841 -79.70395 1.58331 57.19508 -4.16120 -93.52180 25.71181 1.93684 -77.00374 17.84641 33.10947 -29.89588 -24.63551 -52.11548 -62.02290 -76.17294 -54.67373 -83.22724

Observations

  • Odds of cancelling booking: No of adults increases by 11.5%, No. of children increases by 16.5%, No. of weekend nights increases by 11.4%, No. of week nights increases by 4.3%, Required car parking space decreases by 79.7%, Lead time increases by 1.6%, Arrival year increases by 57.2%, Arrival month decreases by 4.2%, Repeated guest decreases by 93.5%, No. of previous cancellations increases by 25.7%, Average price per room increases by 1.9%, No. of special requests decrease by 77.0%, Type of meal plan increases by 17.8% and 33%, Room type reserved and market segment witness a decrease.

Model performance evaluation¶

In [105]:
# creating confusion matrix
# Convert X_train1 columns to numeric type before prediction
X_train1_numeric = X_train1.astype(float) #Convert all columns to float64

confusion_matrix_statsmodels(lg1, X_train1_numeric, y_train)
No description has been provided for this image
In [106]:
log_model_train_perf = model_performance_classification_statsmodels(lg1,X_train1,y_train)

ROC-AUC¶

  • ROC-AUC on training set
In [107]:
# Before prediction, ensure all columns are numeric and handle potential non-numeric values
X_train1 = X_train1.apply(pd.to_numeric, errors='coerce').fillna(0)

# Convert X_train1 columns to numeric type before prediction
X_train1_numeric = X_train1.astype(np.float64) # Use np.float64 for consistency

logit_roc_auc_train = roc_auc_score(y_train, lg1.predict(X_train1_numeric)) # Pass X_train1_numeric to predict
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train1_numeric)) # Pass X_train1_numeric to predict
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
No description has been provided for this image

Observations

The model's performance is good.

Optimal threshold using AUC-ROC curve¶

In [108]:
# Before prediction, ensure all columns are numeric and handle potential non-numeric values
X_train1 = X_train1.apply(pd.to_numeric, errors='coerce').fillna(0)

# Convert X_train1 columns to numeric type before prediction
# Explicitly cast all columns to float64
X_train1 = X_train1.astype(np.float64)

# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train1))

optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
0.3700522558708252

Confusion matrix using 0.37 as threshold

In [109]:
confusion_matrix_statsmodels(lg1,X_train1,y_train,threshold=optimal_threshold_auc_roc)
No description has been provided for this image

Observations

Model captures more cancellations accurately (higher true positives) but has a higher rate of false positives, meaning it might slightly overpredict cancellations.

In [110]:
# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_statsmodels(
    lg1, X_train1, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc
Training performance:
Out[110]:
Accuracy Recall Precision F1_Score
0 0.79265 0.73622 0.66808 0.70049

Observations

There has been a significant improvement in the F1 Score.

Using the Precision-Recall curve to see if we can find a better threshold¶

In [111]:
y_scores = lg1.predict(X_train1)
prec, rec, tre = precision_recall_curve(y_train, y_scores,)


def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="precision")
    plt.plot(thresholds, recalls[:-1], "g--", label="recall")
    plt.xlabel("Threshold")
    plt.legend(loc="upper left")
    plt.ylim([0, 1])


plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
No description has been provided for this image
In [112]:
# setting the threshold
optimal_threshold_curve = 0.42

Observations

We find a threshold at 0.42

Confusion matrix using threshold of 0.42

In [113]:
# setting the threshold
optimal_threshold_curve = 0.42

# Use optimal_threshold_curve as the threshold
confusion_matrix_statsmodels(lg1, X_train1, y_train, threshold=optimal_threshold_curve)
No description has been provided for this image
In [114]:
log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
    lg1, X_train1, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve
Training performance:
Out[114]:
Accuracy Recall Precision F1_Score
0 0.80132 0.69939 0.69797 0.69868

Observations

There is a slight drop in the F1 score.

Checking the performance on the test set using the default threshold

In [115]:
# Assuming X_test1 is a pandas DataFrame or a NumPy array
X_test1 = X_test1.astype(np.float64)  # If it's a DataFrame
# or
X_test1 = X_test1.astype(float)  # If it's a NumPy array

# Now, call your confusion matrix function
confusion_matrix_statsmodels(lg1, X_test1, y_test)
No description has been provided for this image
In [116]:
#Metrics
log_reg_model_test_perf = model_performance_classification_statsmodels(lg1,X_test1,y_test) ## Complete the code to check performance on X_test1 and y_test
In [117]:
print("Test performance:")
log_reg_model_test_perf
Test performance:
Out[117]:
Accuracy Recall Precision F1_Score
0 0.80465 0.63089 0.72900 0.67641

Observations

  • The model is performing well in the test performance, with an accuracy of 80.47%, indicating that the model mostly correctly predicts the booking status. With a recall of 63.09%, the model is only moderately good at identifying cancellations. Precision is 72.90%, which also indicates that true positives are okay for the prediction of cancellations when the model predicts positive.

  • The F1 score of the model would be 67.64%, which indicates that, on average, the model produces a performance that is satisfactory but far from excellent on cancellations. Overall, the model is doing okay, though recall could be better for picking cancellations.

  • ROC curve on test set
In [118]:
logit_roc_auc_train = roc_auc_score(y_test, lg1.predict(X_test1))
fpr, tpr, thresholds = roc_curve(y_test, lg1.predict(X_test1))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
No description has been provided for this image

Observations

  • With 0.86, the model is good at drawing a distinction between cancelled and non- cancelled booking.
  • There is a strong balance between true positive rate and false positive rate.
  • The model is reliable for predicting booking cancellations.

Using model with threshold=0.37

In [119]:
confusion_matrix_statsmodels(lg1,X_test1,y_test,threshold=optimal_threshold_auc_roc)
No description has been provided for this image
In [120]:
# checking model performance for this model
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_statsmodels(
    lg1, X_test1, y_test, threshold=optimal_threshold_auc_roc
)
print("Test performance:")
log_reg_model_test_perf_threshold_auc_roc
Test performance:
Out[120]:
Accuracy Recall Precision F1_Score
0 0.79555 0.73964 0.66573 0.70074

Observations

  • The model with 0.37 threshold is more sensitive to booking cancellations than the model with threshold of 0.50.

Using model with threshold=0.42

In [121]:
# Get predicted probabilities
y_pred_prob = lg1.predict(X_test1)

# Calculate precision and recall for different thresholds
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)

# Find the optimal threshold based on your criteria (e.g., F1-score)
# This is just an example, adapt it to your specific requirements
f1_scores = 2 * (precision * recall) / (precision + recall)
optimal_threshold_recall_precision = thresholds[np.argmax(f1_scores)]

# Now you can use the optimal threshold in your confusion matrix function call
confusion_matrix_statsmodels(lg1, X_test1, y_test, optimal_threshold_recall_precision)
No description has been provided for this image
In [122]:
#Metrics
log_reg_model_test_perf_threshold_curve = model_performance_classification_statsmodels(
    lg1, X_test1, y_test, threshold=optimal_threshold_recall_precision
)
print("Test performance:")
log_reg_model_test_perf_threshold_curve
Test performance:
Out[122]:
Accuracy Recall Precision F1_Score
0 0.79224 0.75923 0.65427 0.70285

Observations

  • The model is good for effectively capturing cancellations.

Final Model Summary¶

In [123]:
# Assuming 'model_performance_classification_statsmodels' is a defined function
# and 'X_train1', 'y_train' are your training data
log_reg_model_train_perf = model_performance_classification_statsmodels(lg1, X_train1, y_train)
In [124]:
#Model Comparison Training Set
models_train_comp_df = pd.concat(
    [
        log_model_train_perf.T,
        log_reg_model_train_perf_threshold_auc_roc.T,
        log_reg_model_train_perf_threshold_curve.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Logistic Regression-default Threshold",
    "Logistic Regression-0.37 Threshold",
    "Logistic Regression-0.42 Threshold",
]

print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[124]:
Logistic Regression-default Threshold Logistic Regression-0.37 Threshold Logistic Regression-0.42 Threshold
Accuracy 0.80545 0.79265 0.80132
Recall 0.63267 0.73622 0.69939
Precision 0.73907 0.66808 0.69797
F1_Score 0.68174 0.70049 0.69868
In [125]:
#Model Comparison Test set
models_test_comp_df = pd.concat(
    [
        log_reg_model_test_perf.T,
        log_reg_model_test_perf_threshold_auc_roc.T,
        log_reg_model_test_perf_threshold_curve.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Logistic Regression-default Threshold",
    "Logistic Regression-0.37 Threshold",
    "Logistic Regression-0.42 Threshold",
]

print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
Out[125]:
Logistic Regression-default Threshold Logistic Regression-0.37 Threshold Logistic Regression-0.42 Threshold
Accuracy 0.80465 0.79555 0.79224
Recall 0.63089 0.73964 0.75923
Precision 0.72900 0.66573 0.65427
F1_Score 0.67641 0.70074 0.70285

Observations

  • There is no overfitting or underfitting in any of the models.
  • The models have similar F1 Scores.

Building a Decision Tree model¶

Data Preparation for modeling (Decision Tree)

  • We want to predict which bookings will be canceled.
  • Before we proceed to build a model, we'll have to encode categorical features.
  • We'll split the data into train and test to be able to evaluate the model that we build on the train data.
In [126]:
#Creating independent and dependent variables
X = data.drop(['booking_status'],axis=1)
Y = data['booking_status']

#Create dummy variables
X = pd.get_dummies(X, drop_first=True)

#Splitting for training and test data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30, random_state=1)
In [127]:
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:",y_train.value_counts(normalize=True))
print("Percentage of classes in test set:",y_test.value_counts(normalize=True))
Shape of Training set :  (25392, 27)
Shape of test set :  (10883, 27)
Percentage of classes in training set: booking_status
0   0.67064
1   0.32936
Name: proportion, dtype: float64
Percentage of classes in test set: booking_status
0   0.67638
1   0.32362
Name: proportion, dtype: float64

Functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.¶

  • The model_performance_classification_sklearn function will be used to check the model performance of models.
  • The confusion_matrix_sklearnfunction will be used to plot the confusion matrix.
In [128]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [129]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Building Decision Tree Model¶

In [130]:
model = DecisionTreeClassifier(random_state=1)
model.fit(X_train,y_train)
Out[130]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)

Checking model performance on training set¶

In [131]:
confusion_matrix_sklearn(model,X_train,y_train)
No description has been provided for this image
In [132]:
decision_tree_perf_train_default = model_performance_classification_sklearn(
    model, X_train, y_train
)
decision_tree_perf_train_default
Out[132]:
Accuracy Recall Precision F1
0 0.99421 0.98661 0.99578 0.99117

Observations

  • The model has a strong ability to correctly identify class 0; True Negatives of 16994.
  • The model has good predictive performance.
  • The F1 Score is almost a perfect score- 99.1%

Checking model performance on test set¶

In [133]:
confusion_matrix_sklearn(model,X_test,y_test)
No description has been provided for this image
In [134]:
decision_tree_perf_test_default = model_performance_classification_sklearn(model,X_test,y_test) ## Complete the code to check performance on test set
decision_tree_perf_test_default
Out[134]:
Accuracy Recall Precision F1
0 0.87118 0.81175 0.79461 0.80309

Observations

  • The F1 Score is 80%, which is less than the score we got from the training set (99%).
  • There could be overfitting of the data.

Before pruning the tree let's check the important features.

In [135]:
feature_names = list(X_train.columns)
importances = model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
No description has been provided for this image

Observations

  • Lead time is the most important predictor of booking cancellation by guests

Do we need to prune the tree?¶

Pre-Pruning

In [136]:
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight="balanced")

# Grid of parameters to choose from
parameters = {
    "max_depth": np.arange(2, 7, 2),
    "max_leaf_nodes": [50, 75, 150, 250],
    "min_samples_split": [10, 30, 50, 70],
}

# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(f1_score)

# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
Out[136]:
DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=50,
                       min_samples_split=10, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=50,
                       min_samples_split=10, random_state=1)

Checking performance on training set¶

In [137]:
confusion_matrix_sklearn(estimator,X_train,y_train)
No description has been provided for this image
In [138]:
decision_tree_tune_perf_train = model_performance_classification_sklearn(estimator,X_train,y_train)
decision_tree_tune_perf_train
Out[138]:
Accuracy Recall Precision F1
0 0.83097 0.78608 0.72425 0.75390

Observations

  • The pre-pruning has reduced the F1 Score of the training set from 99% to 75%
  • The model works relatively well, and there is great room for further improvement to bring down the numbers of false positives and false negatives.

Checking performance on test set¶

In [138]:
 
In [139]:
decision_tree_tune_perf_test = model_performance_classification_sklearn(estimator,X_test,y_test)
decision_tree_tune_perf_test
Out[139]:
Accuracy Recall Precision F1
0 0.83497 0.78336 0.72758 0.75444

Observations

  • Pre-pruning has reduced the F1 score from 80% to 75%, an indication that overfitting has been reduced.

Visualizing the Decision Tree¶

In [140]:
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
    estimator,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
No description has been provided for this image
In [141]:
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- lead_time <= 151.50
|   |--- no_of_special_requests <= 0.50
|   |   |--- market_segment_type_Online <= 0.50
|   |   |   |--- lead_time <= 90.50
|   |   |   |   |--- no_of_weekend_nights <= 0.50
|   |   |   |   |   |--- avg_price_per_room <= 196.50
|   |   |   |   |   |   |--- weights: [1736.39, 133.59] class: 0
|   |   |   |   |   |--- avg_price_per_room >  196.50
|   |   |   |   |   |   |--- weights: [0.75, 24.29] class: 1
|   |   |   |   |--- no_of_weekend_nights >  0.50
|   |   |   |   |   |--- lead_time <= 68.50
|   |   |   |   |   |   |--- weights: [960.27, 223.16] class: 0
|   |   |   |   |   |--- lead_time >  68.50
|   |   |   |   |   |   |--- weights: [129.73, 160.92] class: 1
|   |   |   |--- lead_time >  90.50
|   |   |   |   |--- lead_time <= 117.50
|   |   |   |   |   |--- avg_price_per_room <= 93.58
|   |   |   |   |   |   |--- weights: [214.72, 227.72] class: 1
|   |   |   |   |   |--- avg_price_per_room >  93.58
|   |   |   |   |   |   |--- weights: [82.76, 285.41] class: 1
|   |   |   |   |--- lead_time >  117.50
|   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |--- weights: [87.23, 81.98] class: 0
|   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |--- weights: [228.14, 48.58] class: 0
|   |   |--- market_segment_type_Online >  0.50
|   |   |   |--- lead_time <= 13.50
|   |   |   |   |--- avg_price_per_room <= 99.44
|   |   |   |   |   |--- arrival_month <= 1.50
|   |   |   |   |   |   |--- weights: [92.45, 0.00] class: 0
|   |   |   |   |   |--- arrival_month >  1.50
|   |   |   |   |   |   |--- weights: [363.83, 132.08] class: 0
|   |   |   |   |--- avg_price_per_room >  99.44
|   |   |   |   |   |--- lead_time <= 3.50
|   |   |   |   |   |   |--- weights: [219.94, 85.01] class: 0
|   |   |   |   |   |--- lead_time >  3.50
|   |   |   |   |   |   |--- weights: [132.71, 280.85] class: 1
|   |   |   |--- lead_time >  13.50
|   |   |   |   |--- required_car_parking_space <= 0.50
|   |   |   |   |   |--- avg_price_per_room <= 71.92
|   |   |   |   |   |   |--- weights: [158.80, 159.40] class: 1
|   |   |   |   |   |--- avg_price_per_room >  71.92
|   |   |   |   |   |   |--- weights: [850.67, 3543.28] class: 1
|   |   |   |   |--- required_car_parking_space >  0.50
|   |   |   |   |   |--- weights: [48.46, 1.52] class: 0
|   |--- no_of_special_requests >  0.50
|   |   |--- no_of_special_requests <= 1.50
|   |   |   |--- market_segment_type_Online <= 0.50
|   |   |   |   |--- lead_time <= 102.50
|   |   |   |   |   |--- type_of_meal_plan_Not Selected <= 0.50
|   |   |   |   |   |   |--- weights: [697.09, 9.11] class: 0
|   |   |   |   |   |--- type_of_meal_plan_Not Selected >  0.50
|   |   |   |   |   |   |--- weights: [15.66, 9.11] class: 0
|   |   |   |   |--- lead_time >  102.50
|   |   |   |   |   |--- no_of_week_nights <= 2.50
|   |   |   |   |   |   |--- weights: [32.06, 19.74] class: 0
|   |   |   |   |   |--- no_of_week_nights >  2.50
|   |   |   |   |   |   |--- weights: [44.73, 3.04] class: 0
|   |   |   |--- market_segment_type_Online >  0.50
|   |   |   |   |--- lead_time <= 8.50
|   |   |   |   |   |--- lead_time <= 4.50
|   |   |   |   |   |   |--- weights: [498.03, 44.03] class: 0
|   |   |   |   |   |--- lead_time >  4.50
|   |   |   |   |   |   |--- weights: [258.71, 63.76] class: 0
|   |   |   |   |--- lead_time >  8.50
|   |   |   |   |   |--- required_car_parking_space <= 0.50
|   |   |   |   |   |   |--- weights: [2512.51, 1451.32] class: 0
|   |   |   |   |   |--- required_car_parking_space >  0.50
|   |   |   |   |   |   |--- weights: [134.20, 1.52] class: 0
|   |   |--- no_of_special_requests >  1.50
|   |   |   |--- lead_time <= 90.50
|   |   |   |   |--- no_of_week_nights <= 3.50
|   |   |   |   |   |--- weights: [1585.04, 0.00] class: 0
|   |   |   |   |--- no_of_week_nights >  3.50
|   |   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |   |--- weights: [180.42, 57.69] class: 0
|   |   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |   |--- weights: [52.19, 0.00] class: 0
|   |   |   |--- lead_time >  90.50
|   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |--- weights: [184.90, 56.17] class: 0
|   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |--- weights: [106.61, 106.27] class: 0
|   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |--- weights: [67.10, 0.00] class: 0
|--- lead_time >  151.50
|   |--- avg_price_per_room <= 100.04
|   |   |--- no_of_special_requests <= 0.50
|   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |--- market_segment_type_Online <= 0.50
|   |   |   |   |   |--- lead_time <= 163.50
|   |   |   |   |   |   |--- weights: [3.73, 24.29] class: 1
|   |   |   |   |   |--- lead_time >  163.50
|   |   |   |   |   |   |--- weights: [257.96, 62.24] class: 0
|   |   |   |   |--- market_segment_type_Online >  0.50
|   |   |   |   |   |--- avg_price_per_room <= 2.50
|   |   |   |   |   |   |--- weights: [8.95, 3.04] class: 0
|   |   |   |   |   |--- avg_price_per_room >  2.50
|   |   |   |   |   |   |--- weights: [0.75, 97.16] class: 1
|   |   |   |--- no_of_adults >  1.50
|   |   |   |   |--- avg_price_per_room <= 82.47
|   |   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |   |--- weights: [2.98, 282.37] class: 1
|   |   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |   |--- weights: [213.97, 385.60] class: 1
|   |   |   |   |--- avg_price_per_room >  82.47
|   |   |   |   |   |--- no_of_adults <= 2.50
|   |   |   |   |   |   |--- weights: [23.86, 1030.80] class: 1
|   |   |   |   |   |--- no_of_adults >  2.50
|   |   |   |   |   |   |--- weights: [5.22, 0.00] class: 0
|   |   |--- no_of_special_requests >  0.50
|   |   |   |--- no_of_weekend_nights <= 0.50
|   |   |   |   |--- lead_time <= 180.50
|   |   |   |   |   |--- lead_time <= 159.50
|   |   |   |   |   |   |--- weights: [7.46, 7.59] class: 1
|   |   |   |   |   |--- lead_time >  159.50
|   |   |   |   |   |   |--- weights: [37.28, 4.55] class: 0
|   |   |   |   |--- lead_time >  180.50
|   |   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |   |--- weights: [20.13, 212.54] class: 1
|   |   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |   |--- weights: [8.95, 0.00] class: 0
|   |   |   |--- no_of_weekend_nights >  0.50
|   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |--- weights: [231.12, 110.82] class: 0
|   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |--- weights: [19.38, 34.92] class: 1
|   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |--- lead_time <= 348.50
|   |   |   |   |   |   |--- weights: [106.61, 3.04] class: 0
|   |   |   |   |   |--- lead_time >  348.50
|   |   |   |   |   |   |--- weights: [5.96, 4.55] class: 0
|   |--- avg_price_per_room >  100.04
|   |   |--- arrival_month <= 11.50
|   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |--- weights: [0.00, 3200.19] class: 1
|   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |--- weights: [23.11, 0.00] class: 0
|   |   |--- arrival_month >  11.50
|   |   |   |--- no_of_special_requests <= 0.50
|   |   |   |   |--- weights: [35.04, 0.00] class: 0
|   |   |   |--- no_of_special_requests >  0.50
|   |   |   |   |--- arrival_date <= 24.50
|   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |--- arrival_date >  24.50
|   |   |   |   |   |--- weights: [3.73, 22.77] class: 1

In [142]:
# importance of features in the tree building

importances = estimator.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
No description has been provided for this image

Observations

  • Before prepruning, lead time Lead time was the most important predictor of booking cancellation by guests, followed by average price per room. After pre-pruning, lead time is still the most important predictor of booking cancellation by guests. Market segment type online is now the second most important predictor.

Cost Complexity Pruning

In [143]:
clf = DecisionTreeClassifier(random_state=1, class_weight="balanced")
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
In [144]:
pd.DataFrame(path)
Out[144]:
ccp_alphas impurities
0 0.00000 0.00838
1 0.00000 0.00838
2 0.00000 0.00838
3 0.00000 0.00838
4 0.00000 0.00838
... ... ...
1839 0.00890 0.32806
1840 0.00980 0.33786
1841 0.01272 0.35058
1842 0.03412 0.41882
1843 0.08118 0.50000

1844 rows × 2 columns

In [145]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
No description has been provided for this image

Next, we train a decision tree using effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

In [146]:
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(
        random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced"
    )
    clf.fit(X_train, y_train) ## Complete the code to fit decision tree on training data
    clfs.append(clf)
print(
    "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.0811791438913696
In [147]:
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
No description has been provided for this image

F1 Score vs alpha for training and testing sets¶

In [148]:
f1_train = []
for clf in clfs:
    pred_train = clf.predict(X_train)
    values_train = f1_score(y_train, pred_train)
    f1_train.append(values_train)

f1_test = []
for clf in clfs:
    pred_test = clf.predict(X_test)
    values_test = f1_score(y_test, pred_test)
    f1_test.append(values_test)
In [149]:
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("F1 Score")
ax.set_title("F1 Score vs alpha for training and testing sets")
ax.plot(ccp_alphas, f1_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, f1_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
No description has been provided for this image
In [150]:
index_best_model = np.argmax(f1_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.00012267633155167043,
                       class_weight='balanced', random_state=1)

Checking performance on training set¶

In [151]:
confusion_matrix_sklearn(best_model, X_train, y_train)
No description has been provided for this image
In [152]:
decision_tree_post_perf_train = model_performance_classification_sklearn(
    best_model, X_train, y_train
)
decision_tree_post_perf_train
Out[152]:
Accuracy Recall Precision F1
0 0.89954 0.90303 0.81274 0.85551

Observations

  • There is a significant increase in the F1 Score after cost complexity pruning.
  • Overfitting has reduced
  • Improved precision and recall

Checking performance on test set¶

In [153]:
#Confusion Matrix
confusion_matrix_sklearn(best_model, X_test, y_test)
No description has been provided for this image
In [154]:
#Metrics
decision_tree_test =  model_performance_classification_sklearn(best_model,X_train,y_train)
decision_tree_test
Out[154]:
Accuracy Recall Precision F1
0 0.89954 0.90303 0.81274 0.85551
In [155]:
plt.figure(figsize=(20, 10))

out = tree.plot_tree(
    best_model,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
No description has been provided for this image
In [156]:
# Text report showing the rules of a decision tree -

print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- lead_time <= 151.50
|   |--- no_of_special_requests <= 0.50
|   |   |--- market_segment_type_Online <= 0.50
|   |   |   |--- lead_time <= 90.50
|   |   |   |   |--- no_of_weekend_nights <= 0.50
|   |   |   |   |   |--- avg_price_per_room <= 196.50
|   |   |   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |   |   |--- lead_time <= 16.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 68.50
|   |   |   |   |   |   |   |   |   |--- weights: [207.26, 10.63] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  68.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 29.50
|   |   |   |   |   |   |   |   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- no_of_adults >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 5
|   |   |   |   |   |   |   |   |   |--- arrival_date >  29.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 7.59] class: 1
|   |   |   |   |   |   |   |--- lead_time >  16.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 135.00
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |   |--- no_of_previous_bookings_not_canceled <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 4
|   |   |   |   |   |   |   |   |   |   |--- no_of_previous_bookings_not_canceled >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [11.18, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [21.62, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  135.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 12.14] class: 1
|   |   |   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |   |   |--- weights: [1199.59, 1.52] class: 0
|   |   |   |   |   |--- avg_price_per_room >  196.50
|   |   |   |   |   |   |--- weights: [0.75, 24.29] class: 1
|   |   |   |   |--- no_of_weekend_nights >  0.50
|   |   |   |   |   |--- lead_time <= 68.50
|   |   |   |   |   |   |--- arrival_month <= 9.50
|   |   |   |   |   |   |   |--- avg_price_per_room <= 63.29
|   |   |   |   |   |   |   |   |--- arrival_date <= 20.50
|   |   |   |   |   |   |   |   |   |--- type_of_meal_plan_Not Selected <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [41.75, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- type_of_meal_plan_Not Selected >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 3.04] class: 1
|   |   |   |   |   |   |   |   |--- arrival_date >  20.50
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 59.75
|   |   |   |   |   |   |   |   |   |   |--- arrival_date <= 23.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.49, 12.14] class: 1
|   |   |   |   |   |   |   |   |   |   |--- arrival_date >  23.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [14.91, 1.52] class: 0
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  59.75
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 44.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 59.21] class: 1
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  44.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |--- avg_price_per_room >  63.29
|   |   |   |   |   |   |   |   |--- no_of_weekend_nights <= 3.50
|   |   |   |   |   |   |   |   |   |--- lead_time <= 59.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_month <= 7.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- arrival_month >  7.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |--- lead_time >  59.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_month <= 5.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- arrival_month >  5.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [20.13, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- no_of_weekend_nights >  3.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.75, 15.18] class: 1
|   |   |   |   |   |   |--- arrival_month >  9.50
|   |   |   |   |   |   |   |--- weights: [413.04, 27.33] class: 0
|   |   |   |   |   |--- lead_time >  68.50
|   |   |   |   |   |   |--- avg_price_per_room <= 99.98
|   |   |   |   |   |   |   |--- arrival_month <= 3.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 62.50
|   |   |   |   |   |   |   |   |   |--- weights: [15.66, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  62.50
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 80.38
|   |   |   |   |   |   |   |   |   |   |--- weights: [8.20, 25.81] class: 1
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  80.38
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |--- arrival_month >  3.50
|   |   |   |   |   |   |   |   |--- no_of_week_nights <= 2.50
|   |   |   |   |   |   |   |   |   |--- weights: [55.17, 3.04] class: 0
|   |   |   |   |   |   |   |   |--- no_of_week_nights >  2.50
|   |   |   |   |   |   |   |   |   |--- lead_time <= 73.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.55] class: 1
|   |   |   |   |   |   |   |   |   |--- lead_time >  73.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [21.62, 4.55] class: 0
|   |   |   |   |   |   |--- avg_price_per_room >  99.98
|   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |--- weights: [8.95, 0.00] class: 0
|   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 132.43
|   |   |   |   |   |   |   |   |   |--- weights: [9.69, 122.97] class: 1
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  132.43
|   |   |   |   |   |   |   |   |   |--- weights: [6.71, 0.00] class: 0
|   |   |   |--- lead_time >  90.50
|   |   |   |   |--- lead_time <= 117.50
|   |   |   |   |   |--- avg_price_per_room <= 93.58
|   |   |   |   |   |   |--- avg_price_per_room <= 75.07
|   |   |   |   |   |   |   |--- no_of_week_nights <= 2.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 58.75
|   |   |   |   |   |   |   |   |   |--- weights: [5.96, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  58.75
|   |   |   |   |   |   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [4.47, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_month <= 4.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 118.41] class: 1
|   |   |   |   |   |   |   |   |   |   |--- arrival_month >  4.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 4
|   |   |   |   |   |   |   |--- no_of_week_nights >  2.50
|   |   |   |   |   |   |   |   |--- arrival_date <= 11.50
|   |   |   |   |   |   |   |   |   |--- weights: [31.31, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- arrival_date >  11.50
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [23.11, 6.07] class: 0
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights >  1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [5.96, 9.11] class: 1
|   |   |   |   |   |   |--- avg_price_per_room >  75.07
|   |   |   |   |   |   |   |--- arrival_month <= 3.50
|   |   |   |   |   |   |   |   |--- weights: [59.64, 3.04] class: 0
|   |   |   |   |   |   |   |--- arrival_month >  3.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 4.50
|   |   |   |   |   |   |   |   |   |--- weights: [1.49, 16.70] class: 1
|   |   |   |   |   |   |   |   |--- arrival_month >  4.50
|   |   |   |   |   |   |   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 86.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 16.70] class: 1
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  86.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [8.95, 3.04] class: 0
|   |   |   |   |   |   |   |   |   |--- no_of_adults >  1.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_date <= 22.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [44.73, 4.55] class: 0
|   |   |   |   |   |   |   |   |   |   |--- arrival_date >  22.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |--- avg_price_per_room >  93.58
|   |   |   |   |   |   |--- arrival_date <= 11.50
|   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |--- weights: [16.40, 39.47] class: 1
|   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |--- weights: [20.13, 6.07] class: 0
|   |   |   |   |   |   |--- arrival_date >  11.50
|   |   |   |   |   |   |   |--- avg_price_per_room <= 102.09
|   |   |   |   |   |   |   |   |--- weights: [5.22, 144.22] class: 1
|   |   |   |   |   |   |   |--- avg_price_per_room >  102.09
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 109.50
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 16.70] class: 1
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [33.55, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  109.50
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 124.25
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.98, 75.91] class: 1
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  124.25
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 3.04] class: 0
|   |   |   |   |--- lead_time >  117.50
|   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |--- arrival_date <= 7.50
|   |   |   |   |   |   |   |--- weights: [38.02, 0.00] class: 0
|   |   |   |   |   |   |--- arrival_date >  7.50
|   |   |   |   |   |   |   |--- avg_price_per_room <= 93.58
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 65.38
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.55] class: 1
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  65.38
|   |   |   |   |   |   |   |   |   |--- weights: [24.60, 3.04] class: 0
|   |   |   |   |   |   |   |--- avg_price_per_room >  93.58
|   |   |   |   |   |   |   |   |--- arrival_date <= 28.00
|   |   |   |   |   |   |   |   |   |--- weights: [14.91, 72.87] class: 1
|   |   |   |   |   |   |   |   |--- arrival_date >  28.00
|   |   |   |   |   |   |   |   |   |--- weights: [9.69, 1.52] class: 0
|   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |   |   |   |--- weights: [84.25, 0.00] class: 0
|   |   |   |   |   |   |--- no_of_adults >  1.50
|   |   |   |   |   |   |   |--- lead_time <= 125.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 90.85
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 87.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [13.42, 13.66] class: 1
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  87.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 15.18] class: 1
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  90.85
|   |   |   |   |   |   |   |   |   |--- weights: [10.44, 0.00] class: 0
|   |   |   |   |   |   |   |--- lead_time >  125.50
|   |   |   |   |   |   |   |   |--- arrival_date <= 19.50
|   |   |   |   |   |   |   |   |   |--- weights: [58.15, 18.22] class: 0
|   |   |   |   |   |   |   |   |--- arrival_date >  19.50
|   |   |   |   |   |   |   |   |   |--- weights: [61.88, 1.52] class: 0
|   |   |--- market_segment_type_Online >  0.50
|   |   |   |--- lead_time <= 13.50
|   |   |   |   |--- avg_price_per_room <= 99.44
|   |   |   |   |   |--- arrival_month <= 1.50
|   |   |   |   |   |   |--- weights: [92.45, 0.00] class: 0
|   |   |   |   |   |--- arrival_month >  1.50
|   |   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |   |--- no_of_weekend_nights <= 1.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 70.05
|   |   |   |   |   |   |   |   |   |--- weights: [31.31, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  70.05
|   |   |   |   |   |   |   |   |   |--- lead_time <= 5.50
|   |   |   |   |   |   |   |   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [38.77, 1.52] class: 0
|   |   |   |   |   |   |   |   |   |   |--- no_of_adults >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |--- lead_time >  5.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_date <= 3.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [6.71, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- arrival_date >  3.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [34.30, 40.99] class: 1
|   |   |   |   |   |   |   |--- no_of_weekend_nights >  1.50
|   |   |   |   |   |   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 19.74] class: 1
|   |   |   |   |   |   |   |   |--- no_of_adults >  1.50
|   |   |   |   |   |   |   |   |   |--- lead_time <= 2.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 74.21
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 3.04] class: 1
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  74.21
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [9.69, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- lead_time >  2.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [4.47, 10.63] class: 1
|   |   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |   |--- no_of_week_nights <= 3.50
|   |   |   |   |   |   |   |   |--- weights: [155.07, 6.07] class: 0
|   |   |   |   |   |   |   |--- no_of_week_nights >  3.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |--- weights: [3.73, 10.63] class: 1
|   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |--- weights: [7.46, 0.00] class: 0
|   |   |   |   |--- avg_price_per_room >  99.44
|   |   |   |   |   |--- lead_time <= 3.50
|   |   |   |   |   |   |--- avg_price_per_room <= 202.67
|   |   |   |   |   |   |   |--- no_of_week_nights <= 4.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 5.50
|   |   |   |   |   |   |   |   |   |--- weights: [63.37, 30.36] class: 0
|   |   |   |   |   |   |   |   |--- arrival_month >  5.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 20.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [115.56, 12.14] class: 0
|   |   |   |   |   |   |   |   |   |--- arrival_date >  20.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_date <= 24.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- arrival_date >  24.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [28.33, 3.04] class: 0
|   |   |   |   |   |   |   |--- no_of_week_nights >  4.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 6.07] class: 1
|   |   |   |   |   |   |--- avg_price_per_room >  202.67
|   |   |   |   |   |   |   |--- weights: [0.75, 22.77] class: 1
|   |   |   |   |   |--- lead_time >  3.50
|   |   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |   |--- avg_price_per_room <= 119.25
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 118.50
|   |   |   |   |   |   |   |   |   |--- weights: [18.64, 59.21] class: 1
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  118.50
|   |   |   |   |   |   |   |   |   |--- weights: [8.20, 1.52] class: 0
|   |   |   |   |   |   |   |--- avg_price_per_room >  119.25
|   |   |   |   |   |   |   |   |--- weights: [34.30, 171.55] class: 1
|   |   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |--- weights: [26.09, 1.52] class: 0
|   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 14.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [9.69, 36.43] class: 1
|   |   |   |   |   |   |   |   |   |--- arrival_date >  14.00
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 208.67
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  208.67
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.55] class: 1
|   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |--- weights: [15.66, 0.00] class: 0
|   |   |   |--- lead_time >  13.50
|   |   |   |   |--- required_car_parking_space <= 0.50
|   |   |   |   |   |--- avg_price_per_room <= 71.92
|   |   |   |   |   |   |--- avg_price_per_room <= 59.43
|   |   |   |   |   |   |   |--- lead_time <= 84.50
|   |   |   |   |   |   |   |   |--- weights: [50.70, 7.59] class: 0
|   |   |   |   |   |   |   |--- lead_time >  84.50
|   |   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 27.00
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 131.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 15.18] class: 1
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  131.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- arrival_date >  27.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |   |--- weights: [10.44, 0.00] class: 0
|   |   |   |   |   |   |--- avg_price_per_room >  59.43
|   |   |   |   |   |   |   |--- lead_time <= 25.50
|   |   |   |   |   |   |   |   |--- weights: [20.88, 6.07] class: 0
|   |   |   |   |   |   |   |--- lead_time >  25.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 71.34
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 3.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 68.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [15.66, 78.94] class: 1
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  68.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |--- arrival_month >  3.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 102.00
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  102.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [12.67, 3.04] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  71.34
|   |   |   |   |   |   |   |   |   |--- weights: [11.18, 0.00] class: 0
|   |   |   |   |   |--- avg_price_per_room >  71.92
|   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |--- lead_time <= 65.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 120.45
|   |   |   |   |   |   |   |   |   |--- weights: [79.77, 9.11] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  120.45
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 12.14] class: 1
|   |   |   |   |   |   |   |--- lead_time >  65.50
|   |   |   |   |   |   |   |   |--- type_of_meal_plan_Meal Plan 2 <= 0.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 27.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [16.40, 47.06] class: 1
|   |   |   |   |   |   |   |   |   |--- arrival_date >  27.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- type_of_meal_plan_Meal Plan 2 >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 63.76] class: 1
|   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |--- avg_price_per_room <= 104.31
|   |   |   |   |   |   |   |   |--- lead_time <= 25.50
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_month <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [16.40, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- arrival_month >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [38.77, 118.41] class: 1
|   |   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [23.11, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- lead_time >  25.50
|   |   |   |   |   |   |   |   |   |--- type_of_meal_plan_Not Selected <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [39.51, 185.21] class: 1
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 6
|   |   |   |   |   |   |   |   |   |--- type_of_meal_plan_Not Selected >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [73.81, 411.41] class: 1
|   |   |   |   |   |   |   |--- avg_price_per_room >  104.31
|   |   |   |   |   |   |   |   |--- arrival_month <= 10.50
|   |   |   |   |   |   |   |   |   |--- room_type_reserved_Room_Type 5 <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 195.30
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 9
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  195.30
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 138.15] class: 1
|   |   |   |   |   |   |   |   |   |--- room_type_reserved_Room_Type 5 >  0.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_date <= 22.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [11.18, 6.07] class: 0
|   |   |   |   |   |   |   |   |   |   |--- arrival_date >  22.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 9.11] class: 1
|   |   |   |   |   |   |   |   |--- arrival_month >  10.50
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 168.06
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 22.00
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  22.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [17.15, 83.50] class: 1
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  168.06
|   |   |   |   |   |   |   |   |   |   |--- weights: [12.67, 6.07] class: 0
|   |   |   |   |--- required_car_parking_space >  0.50
|   |   |   |   |   |--- weights: [48.46, 1.52] class: 0
|   |--- no_of_special_requests >  0.50
|   |   |--- no_of_special_requests <= 1.50
|   |   |   |--- market_segment_type_Online <= 0.50
|   |   |   |   |--- lead_time <= 102.50
|   |   |   |   |   |--- type_of_meal_plan_Not Selected <= 0.50
|   |   |   |   |   |   |--- weights: [697.09, 9.11] class: 0
|   |   |   |   |   |--- type_of_meal_plan_Not Selected >  0.50
|   |   |   |   |   |   |--- lead_time <= 63.00
|   |   |   |   |   |   |   |--- weights: [15.66, 1.52] class: 0
|   |   |   |   |   |   |--- lead_time >  63.00
|   |   |   |   |   |   |   |--- weights: [0.00, 7.59] class: 1
|   |   |   |   |--- lead_time >  102.50
|   |   |   |   |   |--- no_of_week_nights <= 2.50
|   |   |   |   |   |   |--- lead_time <= 105.00
|   |   |   |   |   |   |   |--- weights: [0.75, 6.07] class: 1
|   |   |   |   |   |   |--- lead_time >  105.00
|   |   |   |   |   |   |   |--- weights: [31.31, 13.66] class: 0
|   |   |   |   |   |--- no_of_week_nights >  2.50
|   |   |   |   |   |   |--- weights: [44.73, 3.04] class: 0
|   |   |   |--- market_segment_type_Online >  0.50
|   |   |   |   |--- lead_time <= 8.50
|   |   |   |   |   |--- lead_time <= 4.50
|   |   |   |   |   |   |--- no_of_week_nights <= 10.00
|   |   |   |   |   |   |   |--- weights: [498.03, 40.99] class: 0
|   |   |   |   |   |   |--- no_of_week_nights >  10.00
|   |   |   |   |   |   |   |--- weights: [0.00, 3.04] class: 1
|   |   |   |   |   |--- lead_time >  4.50
|   |   |   |   |   |   |--- arrival_date <= 13.50
|   |   |   |   |   |   |   |--- arrival_month <= 9.50
|   |   |   |   |   |   |   |   |--- weights: [58.90, 36.43] class: 0
|   |   |   |   |   |   |   |--- arrival_month >  9.50
|   |   |   |   |   |   |   |   |--- weights: [33.55, 1.52] class: 0
|   |   |   |   |   |   |--- arrival_date >  13.50
|   |   |   |   |   |   |   |--- type_of_meal_plan_Not Selected <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [123.76, 9.11] class: 0
|   |   |   |   |   |   |   |--- type_of_meal_plan_Not Selected >  0.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 126.33
|   |   |   |   |   |   |   |   |   |--- weights: [32.80, 3.04] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  126.33
|   |   |   |   |   |   |   |   |   |--- weights: [9.69, 13.66] class: 1
|   |   |   |   |--- lead_time >  8.50
|   |   |   |   |   |--- required_car_parking_space <= 0.50
|   |   |   |   |   |   |--- avg_price_per_room <= 118.55
|   |   |   |   |   |   |   |--- lead_time <= 61.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [70.08, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- arrival_month >  1.50
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 4.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 11
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  4.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 6
|   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |--- weights: [126.74, 1.52] class: 0
|   |   |   |   |   |   |   |--- lead_time >  61.50
|   |   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 7.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [4.47, 57.69] class: 1
|   |   |   |   |   |   |   |   |   |--- arrival_month >  7.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 66.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [5.22, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  66.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 5
|   |   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 9.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 71.93
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [54.43, 3.04] class: 0
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  71.93
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 10
|   |   |   |   |   |   |   |   |   |--- arrival_month >  9.50
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 4
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 6
|   |   |   |   |   |   |--- avg_price_per_room >  118.55
|   |   |   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |   |   |--- arrival_date <= 19.50
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 7.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 177.15
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 6
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  177.15
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  7.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 6.07] class: 1
|   |   |   |   |   |   |   |   |--- arrival_date >  19.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 27.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 121.20
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [18.64, 6.07] class: 0
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  121.20
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 4
|   |   |   |   |   |   |   |   |   |--- arrival_date >  27.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 55.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  55.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 9.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [11.93, 10.63] class: 0
|   |   |   |   |   |   |   |   |   |--- arrival_month >  9.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [37.28, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 119.20
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [9.69, 28.84] class: 1
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  119.20
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 12
|   |   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 100.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [49.95, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  100.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 18.22] class: 1
|   |   |   |   |   |--- required_car_parking_space >  0.50
|   |   |   |   |   |   |--- weights: [134.20, 1.52] class: 0
|   |   |--- no_of_special_requests >  1.50
|   |   |   |--- lead_time <= 90.50
|   |   |   |   |--- no_of_week_nights <= 3.50
|   |   |   |   |   |--- weights: [1585.04, 0.00] class: 0
|   |   |   |   |--- no_of_week_nights >  3.50
|   |   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |   |--- no_of_week_nights <= 9.50
|   |   |   |   |   |   |   |--- lead_time <= 6.50
|   |   |   |   |   |   |   |   |--- weights: [32.06, 0.00] class: 0
|   |   |   |   |   |   |   |--- lead_time >  6.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 5.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [23.11, 1.52] class: 0
|   |   |   |   |   |   |   |   |   |--- arrival_date >  5.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 93.09
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  93.09
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [77.54, 27.33] class: 0
|   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |--- weights: [19.38, 0.00] class: 0
|   |   |   |   |   |   |--- no_of_week_nights >  9.50
|   |   |   |   |   |   |   |--- weights: [0.00, 3.04] class: 1
|   |   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |   |--- weights: [52.19, 0.00] class: 0
|   |   |   |--- lead_time >  90.50
|   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |--- avg_price_per_room <= 202.95
|   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 7.50
|   |   |   |   |   |   |   |   |   |--- weights: [1.49, 9.11] class: 1
|   |   |   |   |   |   |   |   |--- arrival_month >  7.50
|   |   |   |   |   |   |   |   |   |--- weights: [8.20, 3.04] class: 0
|   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |--- lead_time <= 150.50
|   |   |   |   |   |   |   |   |   |--- weights: [175.20, 28.84] class: 0
|   |   |   |   |   |   |   |   |--- lead_time >  150.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.55] class: 1
|   |   |   |   |   |   |--- avg_price_per_room >  202.95
|   |   |   |   |   |   |   |--- weights: [0.00, 10.63] class: 1
|   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |--- avg_price_per_room <= 153.15
|   |   |   |   |   |   |   |--- room_type_reserved_Room_Type 2 <= 0.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 71.12
|   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  71.12
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 90.42
|   |   |   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [12.67, 7.59] class: 0
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  90.42
|   |   |   |   |   |   |   |   |   |   |--- weights: [64.12, 60.72] class: 0
|   |   |   |   |   |   |   |--- room_type_reserved_Room_Type 2 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [5.96, 0.00] class: 0
|   |   |   |   |   |   |--- avg_price_per_room >  153.15
|   |   |   |   |   |   |   |--- weights: [12.67, 3.04] class: 0
|   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |--- weights: [67.10, 0.00] class: 0
|--- lead_time >  151.50
|   |--- avg_price_per_room <= 100.04
|   |   |--- no_of_special_requests <= 0.50
|   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |--- market_segment_type_Online <= 0.50
|   |   |   |   |   |--- lead_time <= 163.50
|   |   |   |   |   |   |--- arrival_month <= 5.00
|   |   |   |   |   |   |   |--- weights: [2.98, 0.00] class: 0
|   |   |   |   |   |   |--- arrival_month >  5.00
|   |   |   |   |   |   |   |--- weights: [0.75, 24.29] class: 1
|   |   |   |   |   |--- lead_time >  163.50
|   |   |   |   |   |   |--- lead_time <= 341.00
|   |   |   |   |   |   |   |--- lead_time <= 173.00
|   |   |   |   |   |   |   |   |--- arrival_date <= 3.50
|   |   |   |   |   |   |   |   |   |--- weights: [46.97, 9.11] class: 0
|   |   |   |   |   |   |   |   |--- arrival_date >  3.50
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights <= 1.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 13.66] class: 1
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights >  1.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 0.00] class: 0
|   |   |   |   |   |   |   |--- lead_time >  173.00
|   |   |   |   |   |   |   |   |--- arrival_month <= 5.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 7.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.55] class: 1
|   |   |   |   |   |   |   |   |   |--- arrival_date >  7.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [6.71, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- arrival_month >  5.50
|   |   |   |   |   |   |   |   |   |--- weights: [188.62, 7.59] class: 0
|   |   |   |   |   |   |--- lead_time >  341.00
|   |   |   |   |   |   |   |--- weights: [13.42, 27.33] class: 1
|   |   |   |   |--- market_segment_type_Online >  0.50
|   |   |   |   |   |--- avg_price_per_room <= 2.50
|   |   |   |   |   |   |--- lead_time <= 285.50
|   |   |   |   |   |   |   |--- weights: [8.20, 0.00] class: 0
|   |   |   |   |   |   |--- lead_time >  285.50
|   |   |   |   |   |   |   |--- weights: [0.75, 3.04] class: 1
|   |   |   |   |   |--- avg_price_per_room >  2.50
|   |   |   |   |   |   |--- weights: [0.75, 97.16] class: 1
|   |   |   |--- no_of_adults >  1.50
|   |   |   |   |--- avg_price_per_room <= 82.47
|   |   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |   |--- weights: [2.98, 282.37] class: 1
|   |   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |--- lead_time <= 244.00
|   |   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 166.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  166.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 57.69] class: 1
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights >  1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [17.89, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_month <= 9.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [11.18, 3.04] class: 0
|   |   |   |   |   |   |   |   |   |   |--- arrival_month >  9.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 12.14] class: 1
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [75.30, 12.14] class: 0
|   |   |   |   |   |   |   |--- lead_time >  244.00
|   |   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |   |--- weights: [25.35, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 80.38
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 3.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [11.18, 264.15] class: 1
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  3.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  80.38
|   |   |   |   |   |   |   |   |   |   |--- weights: [7.46, 0.00] class: 0
|   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |--- weights: [46.22, 0.00] class: 0
|   |   |   |   |--- avg_price_per_room >  82.47
|   |   |   |   |   |--- no_of_adults <= 2.50
|   |   |   |   |   |   |--- lead_time <= 324.50
|   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |--- room_type_reserved_Room_Type 4 <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [7.46, 986.78] class: 1
|   |   |   |   |   |   |   |   |--- room_type_reserved_Room_Type 4 >  0.50
|   |   |   |   |   |   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 10.63] class: 1
|   |   |   |   |   |   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [4.47, 0.00] class: 0
|   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 19.74] class: 1
|   |   |   |   |   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [5.22, 0.00] class: 0
|   |   |   |   |   |   |--- lead_time >  324.50
|   |   |   |   |   |   |   |--- avg_price_per_room <= 89.00
|   |   |   |   |   |   |   |   |--- weights: [5.96, 0.00] class: 0
|   |   |   |   |   |   |   |--- avg_price_per_room >  89.00
|   |   |   |   |   |   |   |   |--- weights: [0.75, 13.66] class: 1
|   |   |   |   |   |--- no_of_adults >  2.50
|   |   |   |   |   |   |--- weights: [5.22, 0.00] class: 0
|   |   |--- no_of_special_requests >  0.50
|   |   |   |--- no_of_weekend_nights <= 0.50
|   |   |   |   |--- lead_time <= 180.50
|   |   |   |   |   |--- lead_time <= 159.50
|   |   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |   |--- weights: [5.96, 0.00] class: 0
|   |   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |   |--- weights: [1.49, 7.59] class: 1
|   |   |   |   |   |--- lead_time >  159.50
|   |   |   |   |   |   |--- arrival_date <= 1.50
|   |   |   |   |   |   |   |--- weights: [1.49, 3.04] class: 1
|   |   |   |   |   |   |--- arrival_date >  1.50
|   |   |   |   |   |   |   |--- weights: [35.79, 1.52] class: 0
|   |   |   |   |--- lead_time >  180.50
|   |   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |   |--- market_segment_type_Online <= 0.50
|   |   |   |   |   |   |   |--- no_of_adults <= 2.50
|   |   |   |   |   |   |   |   |--- weights: [12.67, 3.04] class: 0
|   |   |   |   |   |   |   |--- no_of_adults >  2.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 3.04] class: 1
|   |   |   |   |   |   |--- market_segment_type_Online >  0.50
|   |   |   |   |   |   |   |--- weights: [7.46, 206.46] class: 1
|   |   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |   |--- weights: [8.95, 0.00] class: 0
|   |   |   |--- no_of_weekend_nights >  0.50
|   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |--- avg_price_per_room <= 76.48
|   |   |   |   |   |   |   |--- weights: [46.97, 4.55] class: 0
|   |   |   |   |   |   |--- avg_price_per_room >  76.48
|   |   |   |   |   |   |   |--- no_of_week_nights <= 6.50
|   |   |   |   |   |   |   |   |--- arrival_date <= 27.50
|   |   |   |   |   |   |   |   |   |--- lead_time <= 233.00
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 152.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.49, 4.55] class: 1
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  152.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |--- lead_time >  233.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [23.11, 19.74] class: 0
|   |   |   |   |   |   |   |   |--- arrival_date >  27.50
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 15.18] class: 1
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 269.00
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  269.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.55] class: 1
|   |   |   |   |   |   |   |--- no_of_week_nights >  6.50
|   |   |   |   |   |   |   |   |--- weights: [4.47, 13.66] class: 1
|   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |--- arrival_date <= 14.50
|   |   |   |   |   |   |   |--- weights: [8.20, 3.04] class: 0
|   |   |   |   |   |   |--- arrival_date >  14.50
|   |   |   |   |   |   |   |--- weights: [11.18, 31.88] class: 1
|   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |--- lead_time <= 348.50
|   |   |   |   |   |   |--- weights: [106.61, 3.04] class: 0
|   |   |   |   |   |--- lead_time >  348.50
|   |   |   |   |   |   |--- weights: [5.96, 4.55] class: 0
|   |--- avg_price_per_room >  100.04
|   |   |--- arrival_month <= 11.50
|   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |--- weights: [0.00, 3200.19] class: 1
|   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |--- weights: [23.11, 0.00] class: 0
|   |   |--- arrival_month >  11.50
|   |   |   |--- no_of_special_requests <= 0.50
|   |   |   |   |--- weights: [35.04, 0.00] class: 0
|   |   |   |--- no_of_special_requests >  0.50
|   |   |   |   |--- arrival_date <= 24.50
|   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |--- arrival_date >  24.50
|   |   |   |   |   |--- weights: [3.73, 22.77] class: 1

In [157]:
importances = best_model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
No description has been provided for this image

Observations

Lead time is still the most important predictor of booking cancellation.

Model Performance Comparison and Conclusions¶

In [158]:
#Training performance comparison
models_train_comp_df = pd.concat(
    [
      decision_tree_perf_train_default.T,
       decision_tree_tune_perf_train.T,
        decision_tree_post_perf_train.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree sklearn",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[158]:
Decision Tree sklearn Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 0.99421 0.83097 0.89954
Recall 0.98661 0.78608 0.90303
Precision 0.99578 0.72425 0.81274
F1 0.99117 0.75390 0.85551
In [159]:
#Testing performance comparison
models_test_comp_df = pd.concat(
[
    decision_tree_perf_test_default.T,
    decision_tree_tune_perf_test.T,
    decision_tree_test.T,

],
axis=1,)

models_test_comp_df.columns = [
    "Decision Tree sklearn",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Testing performance comparison:")
models_test_comp_df
Testing performance comparison:
Out[159]:
Decision Tree sklearn Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 0.87118 0.83497 0.89954
Recall 0.81175 0.78336 0.90303
Precision 0.79461 0.72758 0.81274
F1 0.80309 0.75444 0.85551

Observations

  • The default decision tree was overfitting. The training set had an F1 Score(almost a perfect score) which was higher than the testing set.
  • Pre-pruning removed the overfitting. The cost complexity pruning (post-pruning) resulted in the best F1 Score for the decision tree.

Actionable Insights and Recommendations¶

  • What profitable policies for cancellations and refunds can the hotel adopt?
  • What other recommedations would you suggest to the hotel?

Actionable Insights¶

  • No missing values and duplicated values identified in the dataset. That attests to the integrity of the data and its readiness for exploratory analysis. We should continue to ensure data integrity for effective analysis.
  • The presence of outliers in the dataset indicates that some bookings are incredibly long in lead times as far as over 300 days. Shorter lead times should be preferred.
  • Eliminating overfitting in the data was necessary before modeling.
  • The month of October appeared to be the hotel's busiest month.
  • Online segment has a highest cancellation rate. This is followed by offline. INN Hotels Group gets most of its business from these segments.
  • There's high probability that more cancellations would come from online and offline market segments. Steps shoud be taken to reduce cancellations in order to improve revenue, profitability, and resource allocation.
  • Some relationship can be established between booking status, lead time, and market segment type. Lead time is the most important predictor of customer default.
  • INN Hotels Group could build its marketing and resource allocation strategy around the different seasons of the year to improve cost management and profitability.

Recommendations¶

  • Undertake Customer Profitability Analysis for each market segment type to determine which segment is profitable and which one is not. This will enable the hotel to analyse the resources used in serving specific customers and compare these reources to the revenues generated from these customers.
  • INN Hotels should nurture its relationship with their key customers, understand the cost of serving them so the hotel can meet their expectations in a cost-effective manner.
  • INN Hotels should decide on a fair deadline for free cancellation. An appropriate fee should be charged after the free deadline.
  • A No-Show policy should be developed and communicated clearly to the customer at the time of booking. A full rate could be charged on the first night or a different penalty could be considered.
  • Demand full payment for the reservation
  • A very important thing to include in INN Hotels cancellation poicy is the clause about force majeure. Force majeure refers to acts of God, political unrest, or a global pandemic like COVID. By making it possible with such a clause, INN Hotels may allow flexibility in cancellation with empathy to any guest who is affected by any event beyond their control.
  • Build a Machine Learning-based solution to help in predicting booking cancellations.
  • INN Hotels should have a refund policy which should specify the conditions and the method of refund. Is it a full refund, partial, or credit towards a future stay? These terms should be transparent and fair to all guests.
  • In low-demand periods, strategic discounts or promotions may be deployed to attract more guests.